Hostname: page-component-586b7cd67f-rdxmf Total loading time: 0 Render date: 2024-11-25T07:14:45.591Z Has data issue: false hasContentIssue false

Syntactic methods for topic-independent authorship attribution

Published online by Cambridge University Press:  09 August 2017

JOHANNA BJÖRKLUND
Affiliation:
Deptartment of Computer Science, Umeå University, 901 87 Umeå, Sweden e-mails: [email protected], [email protected]
NIKLAS ZECHNER
Affiliation:
Deptartment of Computer Science, Umeå University, 901 87 Umeå, Sweden e-mails: [email protected], [email protected]

Abstract

The efficacy of syntactic features for topic-independent authorship attribution is evaluated, taking a feature set of frequencies of words and punctuation marks as baseline. The features are ‘deep’ in the sense that they are derived by parsing the subject texts, in contrast to ‘shallow’ syntactic features for which a part-of-speech analysis is enough. The experiments are made on two corpora of online texts and one corpus of novels written around the year 1900. The classification tasks include classical closed-world authorship attribution, identification of separate texts among the works of one author, and cross-topic authorship attribution. In the first tasks, the feature sets were fairly evenly matched, but for the last task, the syntax-based feature set outperformed the baseline feature set. These results suggest that, compared to lexical features, syntactic features are more robust to changes in topic.

Type
Articles
Copyright
Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Argamon, S., and Shimoni, A. R. 2003. Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17 (4): 401–12.Google Scholar
Augsten, N., Böhlen, M. H., and Gamper, J., 2005. Approximate matching of hierarchical data using pq-grams. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway: Norwegian University of Science & Technology, pp. 301–12.Google Scholar
Ayala, D. V., Pinto, D., Gómez-Adorno, H., León, S., and Castillo, E. 2013. Lexical-syntactic and graph-based features for authorship verification notebook for PAN at CLEF 2013. In Forner, P., Navigli, R., Tufis, D., and Ferro, N. (eds.), Working Notes for CLEF 2013 Conference, Valencia, Spain, 2013, vol. 1179. Aachen, Germany: Sun SITE Central Europe, 16.Google Scholar
Baayen, H., van Halteren, H., and Tweedie, F. 1996. Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11 (3): 121–32.Google Scholar
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 9931022 (see http://www.jmlr.org/papers/v3/).Google Scholar
Collins, J., Kaufer, D., Vlachos, P., Butler, B., and Ishizaki, S. 2004. Detecting collaborations in text: comparing the authors’ rhetorical language choices in the federalist papers. Computers and the Humanities 38 (1): 1536.Google Scholar
Feng, S., Banerjee, R., and Choi, Y., 2012. Characterizing stylistic Elements in syntactic structure. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, Jeju Island, Korea, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1522–33.Google Scholar
Frantzeskou, G., Stamatatos, E., Gritzalis, S., and Katsikas, S., 2006. Effective identification of source code authors using byte-level information. In Proceedings of the 28th International Conference on Software Engineering, New York, NY, USA: ACM Press, pp. 893–6.Google Scholar
Fuller, S., Maguire, P., and Moser, P. 2014. A deep context grammatical model For authorship attribution. In Calzolari, N. et al. (eds.), Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland: European Language Resources Association, 44884492.Google Scholar
Gamon, M. 2004. Linguistic correlates of style: authorship classification with deep linguistic analysis features. In Computational Linguistics. Morristown, NJ, USA: Association for Computational Linguistics.Google Scholar
Graham, N., Hirst, G., and Marthi, B. 2005. Segmenting documents by stylistic character. Natural Language Engineering 3 (11): 397415.Google Scholar
Hollingsworth, C. 2012. Using dependency-based annotations for authorship identification. In Proceedings of the 15th International Conference, Brno, Czech Republic, 2012 Sojka, P., Horák, A., Kopeček, I., and Pala, K. (eds.), vol. 7499, pp. 314–19. Berlin, Heidelberg: Springer.Google Scholar
Klein, D., and Manning, C. D., 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 423–30.Google Scholar
Koppel, M., Argamon, S., and Shimoni, A. R. 2002. Automatically categorizing written texts by author gender. Literary and Linguistic Computing 4 (17): 401–12.Google Scholar
Koppel, M., and Schler, J., 2004. Authorship verification as a one-class classification problem. In Proceedings of the 21st International Conference on Machine Learning, New York, NY, USA: ACM Press, p. 62.Google Scholar
Lučić, A., and Blake, C. L. 2015. A syntactic characterization of authorship style surrounding proper names. Digital Scholarship in the Humanities 30 (1): 5370.CrossRefGoogle Scholar
Luyckx, K., and Daelemans, W. 2005. Shallow text analysis and machine learning for authorship attribution. In Computational Linguistics in the Netherlands 2004: Selected Papers from the 15th CLIN Meeting, Scott, D. and Uszkoreit, H., pp. 149–60. Utrecht, the Netherlands: Netherlands Graduate School of Linguistics.Google Scholar
Luyckx, K., and Daelemans, W., 2008. Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 513–20.Google Scholar
Menon, R., and Choi, Y. 2011. Domain independent authorship attribution without domain adaptation. In Angelova, G., Bontcheva, K., Mitkov, R., and Nicolov, N. (eds.), Recent Advances in NLP, pp. 309–15. Sofia, Bulgaria: Bulgarian Academy of Sciences.Google Scholar
Mikros, G. K., and Argiri, E. K. 2007. Investigating topic influence in authorship attribution. In Stein, B., Koppel, M., and Stamatatos, E. (eds.), Proceedings of the International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection, vol. 276. Aachen, Germany: Sun SITE Central Europe, 2935.Google Scholar
Olmos, I., Gonzalez, J. A., and Osorio, M., 2005. Subgraph isomorphism detection using a code based representation. In FLAIRS Conference, Miami, FL, USA: The Florida Artificial Intelligence Research Society, pp. 474–9.Google Scholar
Raghavan, S., Kovashka, A., and Mooney, R., 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference (Short Papers), Uppsala, Sweden, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 3842.Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P., 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Arlington, Virginia, United States: AUAI Press, pp. 487–94.Google Scholar
Seroussi, Y., Bohnert, F., and Zukerman, I., 2012. Authorship attribution with author-aware topic models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 264–9.Google Scholar
Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60 (3): 538–56.Google Scholar
Stamatatos, E., Kokkinakis, G., and Fakotakis, N. 2000. Automatic text categorization in terms of genre and author. Computional Linguistics 26 (4): 471–95.Google Scholar
Stein, B., and zu Eissen, S. M., 2007. Intrinsic plagiarism analysis with meta learning. In Proceedings of the SIGIR Workshop on Plagiarism Analysis, Authorship Attribution, and Near-Duplicate Detection, Aachen, Germany: Sun SITE Central Europe, pp. 4550.Google Scholar
Tschuggnall, M., and Specht, G., 2014. Enhancing authorship attribution by utilizing syntax tree profiles. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 195–9.Google Scholar
Wiersma, W., Nerbonne, J., and Lauttamus, T. 2011. Automatically extracting typical syntactic differences from corpora. Literary and Linguistic Computing 26 (1): 107–24.Google Scholar
Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics Bulletin 1 (6): 80–3.Google Scholar
Zechner, N. 2015. Formal Foundations of Authorship Attribution. Licentiate Thesis, Umeå, Sweden: Umeå University Google Scholar
zu Eissen, S. M., Stein, B., and Kulig, M. 2007. Plagiarism detection without reference collections. In Decker, R. and Lenz, H.-J. (eds.), Advances in Data Analysis, Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, 2006 Springer, Berlin, Germany, 359366.Google Scholar