Hostname: page-component-78c5997874-4rdpn Total loading time: 0 Render date: 2024-11-07T20:24:26.679Z Has data issue: false hasContentIssue false

Classifying news versus opinions in newspapers: Linguistic features for domain independence

Published online by Cambridge University Press:  21 February 2017

K. R. KRÜGER
Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
A. LUKOWIAK
Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
J. SONNTAG
Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
S. WARZECHA
Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
M. STEDE
Affiliation:
University of Potsdam, FSP Cognitive Science, Applied Computational Linguistics, Karl-Liebknecht-Straße 24-25, 14476 Potsdam, Germany e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

Newspaper text can be broadly divided in the classes ‘opinion’ (editorials, commentary, letters to the editor) and ‘neutral’ (reports). We describe a classification system for performing this separation, which uses a set of linguistically motivated features. Working with various English newspaper corpora, we demonstrate that it significantly outperforms bag-of-lemma and PoS-tag models. We conclude that the linguistic features constitute the best method for achieving robustness against change of newspaper or domain.

Type
Articles
Copyright
Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Biber, D., and Conrad, S., 2009. Register, Genre, and Style. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
Bird, S., Loper, E., and Klein, E. 2009. Natural Language Processing with Python. Sebastopol, CA: OReilly Media Inc.Google Scholar
Charniak, E., Blaheta, D., Ge, N., Hall, K., Hale, J., and Johnson, M., 2000. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. DVD. Philadelphia: Linguistic Data Consortium.Google Scholar
de Marneffe, M.-C., MacCartney, B., and Manning, C. D., 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 449454.Google Scholar
Esuli, A., and Sebastiani, F., 2006. SENTIWORDNET: a publicly available lexical resource for opinion mining. In Proceedings of the 5th Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 417422.Google Scholar
Feldman, S., Marin, M., Ostendorf, M., and Gupta, M.R., 2009. Part-of-speech histograms for genre classification of text. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, pp. 47814784.Google Scholar
Finn, A., and Kushmerick, N. 2003. Learning to classify documents according to genre. In Proceedings of the Workshop on Computational Approaches to Style Analysis and Synthesis at the International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico.Google Scholar
Freund, L., Clarke, C. L. A., and Toms, E. G., 2006. Towards genre classification for IR in the workplace. In Proceedings of the 1st International Conference on Information Interaction in Context (IIiX), Copenhagen, Denmark, pp. 3036.CrossRefGoogle Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H., 2009. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11 (1): 1018.CrossRefGoogle Scholar
Hosmer, D. W., Lemeshow, S., and Sturdivant, R. X., 2013. Applied Logistic Regression. Hoboken, NJ: Wiley.CrossRefGoogle Scholar
Karlgren, J., and Cutting, D., 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th Conference on Computational Linguistics (COLING 1994), vol. 2, Kyoto, Japan, pp. 10711075.CrossRefGoogle Scholar
Kessler, B., Nunberg, G., and Schütze, H., 1997. Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 3238.Google Scholar
Lippmann, R., 1987. An introduction to computing with neural nets. ASSP Magazine, IEEE 4 (2): 422.CrossRefGoogle Scholar
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D., 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, pp. 5560.CrossRefGoogle Scholar
Moore, A., and Lee, M. S., 1998. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research 8 : 6791.CrossRefGoogle Scholar
Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA: Morgan Kaufmann.Google Scholar
Petrenz, P., and Webber, B., 2011. Stable classification of text genres. Computational Linguistics 37 (2): 385–93.CrossRefGoogle Scholar
Plank, B. 2011. Corresponding genre sets based on the meta-data found in ACL/DCI corpus. http://www.let.rug.nl/~bplank/metadata/genre_files_updated.html. Accessed 2016-07-01.Google Scholar
Platt, J. 1998. Sequential minimal optimization: a fast algorithm for training support vector machines. Technical Report msr-tr-98-14, Microsoft Research.Google Scholar
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., and Webber, B., 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the 6th Conference on International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, pp. 29612968.Google Scholar
Sandhaus, E., 2008. The New York Times Annotated Corpus LDC2008T19. DVD. Philadelphia: Linguistic Data Consortium.Google Scholar
Santini, M. 2007. Automatic Identification of Genre in Web Pages. PhD thesis, University of Brighton, UK.Google Scholar
Sharoff, S., Wu, Z., and Markert, K., 2010. The Web Library of Babel: evaluating genre collections. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC 2010), Valletta, Malta, pp. 3063–70.Google Scholar
Toprak, C., and Gurevych, I., 2009. Document level subjectivity classification experiments in DEFT’09 challenge. In Proceedings of the DÉfi Fouille de Textes (DEFT 2009) Text Mining Challenge, Paris, France, pp. 8997.Google Scholar
Webber, B. L., 2009. Genre distinctions for discourse in the Penn TreeBank. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009) and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 674682.Google Scholar
Wiebe, J., Wilson, T., Bruce, R., Bell, M., and Martin, M., 2004. Learning subjective language. Computational Linguistics 30 (3): 277308.CrossRefGoogle Scholar
Wilson, T., Wiebe, J., and Hoffmann, P., 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP 2005), Vancouver, B.C., pp. 347354.Google Scholar
Yu, H., and Hatzivassiloglou, V., 2003. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2003), Stroudsburg, PA, pp. 129136.CrossRefGoogle Scholar