Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Matthew J. Denny; Arthur Spirling

doi:10.1017/pan.2017.44

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Published online by Cambridge University Press: 19 March 2018

Matthew J. Denny and

Arthur Spirling

Show author details

Matthew J. Denny*: Affiliation:
203 Pond Lab, Pennsylvania State University, University Park, PA 16802, USA. Email: [email protected]
Arthur Spirling: Affiliation:
Office 405, 19 West 4th St., New York University, New York, NY 10012, USA. Email: [email protected]
*: *Email: [email protected]

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher’s substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.

Keywords

statistical analysis of texts unsupervised learning descriptive statistics

Type: Articles
Information: Political Analysis , Volume 26 , Issue 2 , April 2018 , pp. 168 - 189

DOI: https://doi.org/10.1017/pan.2017.44 [Opens in a new window]
Copyright: Copyright © The Author(s) 2018. Published by Cambridge University Press on behalf of the Society for Political Methodology.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Authors’ note: We thank Will Lowe, Scott de Marchi and Brandon Stewart for comments on an earlier draft, and Pablo Barbera for providing the Twitter data used in this paper. Audiences at New York University, University of California San Diego, the Political Methodology meeting (2017), Duke University, University of Michigan, and the International Methods Colloquium provided helpful comments. Suggestions from the editor of Political Analysis, and two anonymous referees, allowed us to improve our article considerably. This research was supported by the National Science Foundation under IGERT Grant DGE-1144860. Replication data for this paper are available via Denny and Spirling (2017). preText software available here: github.com/matthewjdenny/preText

Contributing Editor: R. Michael Alvarez

References

Blei, David M., Ng, Andrew Y., and Jordan, Michael I.. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:993–1022.Google Scholar

Buckland, S. T., Burnham, K. P., and Augustin, N. H.. 1997. Model selection: An integral part of inference. Biometrics 53(2):603–618.Google Scholar

Catalinac, Amy. 2016. Pork to policy: The rise of programmatic campaigning in Japanese elections. Journal of Politics 78(1):1–18.Google Scholar

Chang, Jonathan, Gerrish, Sean, Wang, Chong, Boyd-graber, Jordan L., and Blei, David M.. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems 22 , ed. Bengio, Y., Schuurmans, D., Lafferty, J. D., Williams, C. K. I., and Culotta, A.. Curran Associates, Inc., pp. 288–296.Google Scholar

Denny, Matthew, and Spirling, Arthur. 2017. “Dataverse replication data for: text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about It.” http://dx.doi.org/10.7910/DVN/XRR0HM.Google Scholar

Diermeier, Daniel, Godbout, Jean-François, Yu, Bei, and Kaufmann, Stefan. 2011. Language and ideology in congress. British Journal of Political Science 42(01):31–55.Google Scholar

D’Orazio, Vito, Landis, Steven, Palmer, Glenn, and Schrodt, Philip. 2014. Separating the wheat from the chaff: Applications of automated document classification using support vector machines. Political Analysis 22(2):224–242.Google Scholar

Gelman, Andrew. 2013. Preregistration of studies and mock reports. Political Analysis 21(1):40–41.Google Scholar

Gelman, Andrew, and Loken, Eric. 2014. The statistical crisis in science. American Scientist 102(6):460–465.Google Scholar

Grimmer, J. 2010. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis 18(1):1–35.Google Scholar

Grimmer, Justin, and Stewart, Brandon M.. 2013. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3):267–297.Google Scholar

Grimmer, Justin, and King, Gary. 2011. General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences of the United States of America 108(7):2643–2650.Google Scholar

Handler, Abram, Denny, Matthew J., Wallach, Hanna, and O’Connor, Brendan. 2016. Bag of what? Simple noun phrase extraction for text analysis. Proceedings of the workshop on natural language processing and computational social science at the 2016 conference on empirical methods in natural language processing , https://brenocon.com/handler2016phrases.pdf.Google Scholar

Hopkins, Daniel, and King, Gary. 2010. A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229–247.Google Scholar

James, Gareth, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert. 2013. An introduction to statistical learning . New York: Springer.Google Scholar

Jensen, David D., and Cohen, Paul R.. 2000. Multiple comparisons in induction algorithms. Machine Learning 38(3):309–338.Google Scholar

Jones, Tudor. 1996. Remaking the labour party: From gaitskell to blair . New York: Routledge.Google Scholar

Jurafsky, Daniel, and Martin, James H.. 2008. Speech and language processing: An introduction to natural language processing computational linguistics and speech recognition . Prentice Hall.Google Scholar

Justeson, John S., and Katz, Slava M.. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(01):9–27.Google Scholar

Kavanagh, Dennis. 1997. The reordering of British politics: Politics after thatcher . Oxford University Press.Google Scholar

King, Gary, Lam, Patrick, and Roberts, Margaret E. 2017. Computer-assisted keyword and document set discovery from unstructured text. American Journal of Political Science . Preprint, https://gking.harvard.edu/files/gking/files/ajps12291_final.pdf.Google Scholar

Lauderdale, Benjamin, and Herzog, Alexander. 2016. Measuring political positions from legislative speech. Political Analysis 24(2):1–21.Google Scholar

Laver, Michael, Benoit, Kenneth, and Garry, John. 2003. Extracting policy positions from political texts using words as data. The American Political Science Review 97(2):311–331.Google Scholar

Lowe, Will. 2008. Understanding wordscores. Political Analysis 16(4 SPEC. ISS.):356–371.Google Scholar

Lowe, Will, and Benoit, Kenneth. 2013. Validating estimates of latent traits from textual data using human judgment as a benchmark. Political Analysis 21(3):298–313.Google Scholar

Manning, Christopher D., and Schütze, Hinrich. 1999. Foundations of statistical natural language processing . MIT Press.Google Scholar

Manning, Christopher D., Raghavan, Prabhakar, and Schütze, Hinrich. 2008. An introduction to information retrieval . Cambridge: Cambridge University Press.Google Scholar

Monroe, Burt L., Colaresi, Michael P., and Quinn, Kevin M.. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16:372–403.Google Scholar

Moore, Ryan, Powell, Elinor, and Reeves, Andrew. 2013. Driving support: workers, PACs, and congressional support of the auto industry. Business and Politics 15(2):137–162.Google Scholar

Pang, Bo, Lee, Lillian, and Vaithyanathan, Shivakumar. 2002. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the conference on empirical methods in natural language processing (EMNLP) , pp. 79–86.Google Scholar

Porter, M. F. 1980. An algorithm for suffix stripping. Program 14(3):130–137.Google Scholar

Proksch, Sven-Oliver, and Slapin, Jonathan B.. 2010. Position taking in european parliament speeches. British Journal of Political Science 40(03):587–611.Google Scholar

Pugh, Martin. 2011. Speak for Britain!: A new history of the labour party . New York: Random House.Google Scholar

Quinn, Kevin M., Monroe, Burt L., Colaresi, Michael, Crespin, Michael H., and Radev, Dragomir R.. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1):209–228.Google Scholar

Roberts, Margaret E., Stewart, Brandon M., Tingley, Dustin, Lucas, Christopher, Leder-Luis, Jetson, Gadarian, Shana Kushner, Albertson, Bethany, and Rand, David G.. 2014. Structural topic models for open-ended survey responses. American Journal of Political Science 58(4):1064–1082.Google Scholar

Sebastiani, Fabrizio. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34(1):1–47.Google Scholar

Slapin, Jonathan B., and Proksch, Sven-Oliver. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52:705–722.Google Scholar

Spirling, Arthur. 2012. U.S. treaty making with American Indians: Institutional change and relative power, 1784–1911. American Journal of Political Science 56(1):84–97.Google Scholar

Steegen, Sara, Tuerlinckx, Francis, Gelman, Andrew, and Vanpaemel, Wolf. 2016. Increasing transparency through a multiverse analysis. Perspectives on Psychological Science 11(5):702–712.Google Scholar

Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan, and Mimno, David. 2009. Evaluation methods for topic models. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09 (4):1–8. http://portal.acm.org/citation.cfm?doid=1553374.1553515.Google Scholar

Yano, Tae, Smith, Noah a, and Wilkerson, John D. 2012. Textual predictors of bill survival in congressional committees. Conference of the North American chapter of the association for computational linguistics , pp. 793–802.Google Scholar

Denny and Spirling supplementary material 1

Online Appendix

File 185 KB

Article contents

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Denny and Spirling supplementary material 1

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests