Automated Text Classification of News Articles: A Practical Guide

Pablo Barberá; Amber E. Boydstun; Suzanna Linn; Ryan McMahon; Jonathan Nagler

doi:10.1017/pan.2020.8

Automated Text Classification of News Articles: A Practical Guide

Published online by Cambridge University Press: 09 June 2020

Ryan McMahon and

Pablo Barberá*: Affiliation:
Associate Professor of Political Science and International Relations, University of Southern California, Los Angeles, CA90089, USA. Email: [email protected]
Amber E. Boydstun: Affiliation:
Associate Professor of Political Science, University of California, Davis, CA95616, USA. Email: [email protected]
Suzanna Linn: Affiliation:
Liberal Arts Professor of Political Science, Department of Political Science, Penn State University, University Park, PA16802, USA. Email: [email protected]
Ryan McMahon: Affiliation:
PhD Graduate, Department of Political Science, Penn State University, University Park, PA16802, USA (now at Google). Email: [email protected]
Jonathan Nagler: Affiliation:
Professor of Politics and co-Director of the Center for Social Media and Politics, New York University, New York, NY10012, USA. Email: [email protected]
*: *Email: [email protected]

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.

Keywords

statistical analysis of texts automated content analysis content analysis

Type: Articles
Information: Political Analysis , Volume 29 , Issue 1 , January 2021 , pp. 19 - 42

DOI: https://doi.org/10.1017/pan.2020.8 [Opens in a new window]
Copyright: Copyright © The Author(s) 2020. Published by Cambridge University Press on behalf of the Society for Political Methodology.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Contributing Editor: Jeff Gill

References

Atkinson, M. L., Lovett, J., and Baumgartner, F. R.. 2014. “Measuring the Media Agenda.” Political Communication 31(2):355–380.CrossRef Google Scholar

Bai, J., Song, D., Bruza, P., Nie, J.-Y., and Cao, G.. 2005. “Query Expansion Using Term Relationships in Language Models for Information Retrieval.” In Proceedings of the 14th ACM International Conference on Information and Knowledge Management , 688–695. Bremen, Germany: Association for Computing Machinery.CrossRef Google Scholar

Barberá, P., Boydstun, A., Linn, S., McMahon, R., and Nagler, J.. 2020. “Replication Data for: Automated Text Classification of News Articles: A Practical Guide.” URL: doi:10.7910/DVN/MXKRDE, Harvard Dataverse, V1, UNF:6:AR3Usj7mJKo7lkT/YUsaXA== [fileUNF].Google Scholar

Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., and Mikhaylov, S.. 2016. “Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data.” American Political Science Review 110(2):278–295.CrossRef Google Scholar

Blood, D. J., and Phillips, P. C. B.. 1997. “Economic Headline News on the Agenda: New Approaches to Understanding Causes and Effects.” In Communication and Democracy: Exploring the Intellectual Frontiers in Agenda-setting Theory , edited by McCombs, M., Shaw, D. L., and Weaver, D., 97–113. New York: Routledge.Google Scholar

Bradburn, N. M., Sudman, S., and Wansink, B.. 2004. Asking Questions: The Definitive Guide to Questionnaire Design . San Francisco: John Wiley and Sons.Google Scholar

Caruana, R., and Niculescu-Mizil, A.. 2006. “An Empirical Comparison of Supervised Learning Algorithms.” In Proceedings of the 23rd International Conference on Machine Learning , 161–168. Pittsburgh: Association for Computing Machinery.CrossRef Google Scholar

Condorcet, M. J. et al. . 1972. Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix, vol. 252 . Providence, RI: American Mathematical Society.Google Scholar

De Boef, S., and Kellstedt, P. M.. 2004. “The Political (and Economic) Origins of Consumer Confidence.” American Journal of Political Science 48(4):633–649.CrossRef Google Scholar

Denny, M. J., and Spirling, A.. 2018. “Assessing the Consequences of Text Preprocessing Decisions.” Political Analysis 26:168–189.Google Scholar

Doms, M. E., and Morin, N. J.. “Consumer sentiment, the economy, and the news media.” FRB of San Francisco Working Paper (2004–09), San Francisco: Federal Reserve Board.Google Scholar

Eshbaugh-Soha, M. 2010. “The Tone of Local Presidential News Coverage.” Political Communication 27(2):121–140.CrossRef Google Scholar

Fan, D., Geddes, D., and Flory, F.. 2013. “The Toyota Recall Crisis: Media Impact on Toyota’s Corporate Brand Reputation.” Corporate Reputation Review 16(2):99–117.Google Scholar

Fogarty, B. J. 2005. “Determining Economic News Coverage.” International Journal of Public Opinion Research 17(2):149–172.CrossRef Google Scholar

Goidel, K., Procopio, S., Terrell, D., and Wu, H. D.. 2010. “Sources of Economic News and Economic Expectations.” American Politics Research 38(4):759–777.CrossRef Google Scholar

Goidel, R. K., and Langley, R. E.. 1995. “Media Coverage of the Economy and Aggregate Economic Evaluations: Uncovering Evidence of Indirect Media Effects.” Political Research Quarterly 48(2):313–328.CrossRef Google Scholar

Grimmer, J., and Stewart, B. M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3):267–297.CrossRef Google Scholar

Grimmer, J., Messing, S., and Westwood, S. J.. 2012. “How Words and Money Cultivate a Personal Vote: The Effect of Legislator Credit Claiming on Constituent Credit Allocation.” American Political Science Review 106(04):703–719.Google Scholar

Groves, R., Fowler, F. Jr, Couper, M. P., Lepkowski, J. M., Singer, E., and Tourangeau, R.. 2009. Survey Methodology . 2nd edn. Hoboken, NJ: Wiley.Google Scholar

Hastie, T., Tibshirani, R., and Friedman, J.. 2009. “Unsupervised Learning.” In The Elements of Statistical Learning , edited by Hastie, T., Tibshirani, R., and Friedman, J., 485–585. New York: Springer.CrossRef Google Scholar

Hillard, D., Purpura, S., and Wilkerson, J.. 2008. “Computer-assisted Topic Classification for Mixed-methods Social Science Research.” Journal of Information Technology & Politics 4(4):31–46.CrossRef Google Scholar

Hopkins, D. J., Kim, E., and Kim, S.. 2017. “Does Newspaper Coverage Influence or Reflect Public Perceptions of the Economy?” Research & Politics 4(4): 2053168017737900.CrossRef Google Scholar

James, G., Witten, D., Hastie, T., and Tibshirani, R.. 2013. An Introduction to Statistical Learning, vol. 6 . New York: Springer.CrossRef Google Scholar

Jurka, T. P., Collingwood, L., Boydstun, A. E., Grossman, E., and van Atteveldt, W.. 2013. “RTextTools: A Supervised Learning Package for Text Classification.” The R Journal 5(1):6–12.CrossRef Google Scholar

King, G., Lam, P., and Roberts, M.. 2016. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” Working Paper.CrossRef Google Scholar

Krippendorff, K. 2018. Content Analysis: An Introduction to its Methodology . 4th edn. Thousand Oaks, CA: Sage.Google Scholar

Laver, M., Benoit, K., and Garry, J.. 2003. “Extracting Policy Positions from Political Texts Using Words as Data.” American Political Science Review 97(02):311–331.CrossRef Google Scholar

Lyon, A., and Pacuit, E.. 2013. “The Wisdom of Crowds: Methods of Human Judgement Aggregation.” In Handbook of Human Computation , edited by Michelucci, P., 599–614. New York: Springer.CrossRef Google Scholar

Mitra, M., Singhal, A., and Buckley, C.. 1998. “Improving Automatic Query Expansion.” In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 206–214. Melbourne, Australia: Association for Computing Machinery.Google Scholar

Monroe, B. L., Colaresi, M. P., and Quinn, K. M.. 2008. “Fightin’words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.” Political Analysis 16(4):372–403.CrossRef Google Scholar

Muddiman, A., and Stroud, N. J.. 2017. “News Values, Cognitive Biases, and Partisan Incivility in Comment Sections.” Journal of Communication 67(4):586–609.CrossRef Google Scholar

Page, S. E. 2008. The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies-New Edition . Princeton, NJ: Princeton University Press.CrossRef Google Scholar

Raschka, S. 2015. Python Machine Learning . Birmingham: Packt Publishing Ltd.Google Scholar

Rocchio, J. J. 1971. The SMART Retrieval System—Experiments in Automatic Document Processing . Englewoods Cliffs, NJ: Prentice-Hall.Google Scholar

Schrodt, P.2011. Country Infro, 111216.txt. https://github.com/openeventdata/CountryInfo.Google Scholar

Schütze, H., and Pedersen, J. O.. 1994. “A Cooccurrence-based Thesaurus and Two Applications to Information Retrieval.” Information Processing & Management 33(3):307–318.CrossRef Google Scholar

Soroka, S. N., Stecula, D. A., and Wlezien, C.. 2015. “It’s (Change in) the (Future) Economy, Stupid: Economic Indicators, the Media, and Public Opinion.” American Journal of Political Science 59(2):457–474.CrossRef Google Scholar

Stecula, D. A., and Merkley, E.. 2019. “Framing Climate Change: Economics, Ideology, and Uncertainty in American News Media Content from 1988 to 2014.” Frontiers in Communication 4(6):1–15.CrossRef Google Scholar

Sudman, S., Bradburn, N. M., and Schwartz, N.. 1995. Thinking about Answers: The Application of Cognitive Processes to Survey Methodology . San Francisco: Jossey-Bass.Google Scholar

Surowiecki, J. 2005. The Wisdom of the Crowds . New York: Anchor.Google Scholar

Tetlock, P. C. 2007. “Giving Content to Investor Sentiment: The Role of Media in the Stock Market.” Journal of Finance 62(3):1139–1168.CrossRef Google Scholar

Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., and Kappas, A.. 2010. “Sentiment Strength Detection in Short Informal Text.” Journal of the American Society for Information Science and Technology 61(12):2544–2558.CrossRef Google Scholar

Wu, H. D., Stevenson, R. L., Chen, H.-C., and Güner, Z. N.. 2002. “The Conditioned Impact of Recession News: A Time-Series Analysis of Economic Communication in the United States, 1987–1996.” International Journal of Public Opinion Research 14(1):19–36.CrossRef Google Scholar

Xu, J., and Croft, W. B.. 1996. “Query Expansion Using Local and Global Document Analysis.” In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 4–11. Zurich: Association for Computing Machinery.Google Scholar

Young, L., and Soroka, S.. 2012. “Affective News: The Automated Coding of Sentiment in Political Texts.” Political Communication 29(2):205–231.Google Scholar

Barberá et al. Dataset

Dataset

https://doi.org/10.7910/DVN/MXKRDE

Link

Barberá et al. supplementary material

File 282.6 KB

Article contents

Automated Text Classification of News Articles: A Practical Guide

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Barberá et al. Dataset

Barberá et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests