Hostname: page-component-cd9895bd7-mkpzs Total loading time: 0 Render date: 2024-12-23T13:24:59.009Z Has data issue: false hasContentIssue false

Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches

Published online by Cambridge University Press:  21 April 2020

Blake Miller*
Affiliation:
Department of Methodology, London School of Economics and Political Science, Columbia House, Houghton Street, LondonWC2A 2AE, UK. Email: [email protected]
Fridolin Linder
Affiliation:
Department of Political Science, Social Media and Political Participation Lab, New York University, 431 19 West 4th Street, New York, NY 10012, USA. Email: [email protected]
Walter R. Mebane Jr.
Affiliation:
Professor, Department of Political Science and Department of Statistics, University of Michigan, Haven Hall, Ann Arbor, MI48109-1045, USA. Email: [email protected]

Abstract

Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper, we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length, and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or “passive” learning) to achieve equally performing classifiers. We further investigate how varying levels of intercoder reliability affect the active learning procedures and find that even with low reliability, active learning performs more efficiently than does random sampling.

Type
Articles
Copyright
Copyright © The Author(s) 2020. Published by Cambridge University Press on behalf of the Society for Political Methodology.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Contributing Editor: Jeff Gill

References

Abe, H., and Mamitsuka, N.. 1998. “Query Learning Strategies Using Boosting and Bagging.” In Machine Learning: Proceedings of the Fifteenth international Conference (ICML’98), vol. 1 . San Francisco, CA: Morgan Kaufmann Publishers Inc. Google Scholar
Ali, A., Caruana, R., and Kapoor, A.. 2014. “Active Learning with Model Selection.” In AAAI’14: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence , 16731679. Quebec City: AAAI Press.Google Scholar
Angluin, D. 1988. “Queries and Concept Learning.” Machine Learning 2(4):319342.CrossRefGoogle Scholar
Baram, Y., Yaniv, R. E., and Luz, K.. 2004. “Online Choice of Active Learning Algorithms.” Journal of Machine Learning Research 5(Mar):255291.Google Scholar
Beck, N., King, G., and Zeng, L.. 2000. “Improving Quantitative Studies of International Conflict: A Conjecture.” American Political Science Review 94(1):2135.10.1017/S0003055400220078CrossRefGoogle Scholar
Bergstra, J., and Bengio, Y.. 2012. “Random Search for Hyper-Parameter Optimization.” Journal of Machine Learning Research 13(Feb):281305.Google Scholar
Blei, D. M., Ng, A. Y., and Jordan, M. I.. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3(Jan):9931022.Google Scholar
Ceron, A., Curini, L., Iacus, S. M., and Porro, G.. 2014. “Every Tweet Counts? How Sentiment Analysis of Social Media Can Improve Our Knowledge of Citizens Political Preferences with an Application to Italy and France.” New Media & Society 16(2):340358.CrossRefGoogle Scholar
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., and Blei, D. M.. 2009. “Reading Tea Leaves: How Humans Interpret Topic Models.” In Advances in Neural Information Processing Systems , 288296. New York: Curran Associates.Google Scholar
Collingwood, L., and Wilkerson, J.. 2012. “Tradeoffs in Accuracy and Efficiency in Supervised Learning Methods.” Journal of Information Technology & Politics 9(3):298318.CrossRefGoogle Scholar
Cranmer, S. J., and Desmarais, B. A.. 2017. “What Can We Learn From Predictive Modeling? Political Analysis 25(2):145166.CrossRefGoogle Scholar
Dasgupta, S., Kalai, A. T., and Monteleoni, C.. 2005. “Analysis of Perceptron-based Active Learning.” In International Conference on Computational Learning Theory , 249263. Berlin: Springer.Google Scholar
Denny, M. J., and Spirling, A.. 2018. “Text Preprocessing for Unsupervised Learning: Why it Matters, When it Misleads, and What to do About it.” Political Analysis 26(2):168189.CrossRefGoogle Scholar
Drutman, L., and Hopkins, D. J.. 2013. “The Inside View: Using the Enron e-mail Archive to Understand Corporate Political Attention.” Legislative Studies Quarterly 38(1):530.CrossRefGoogle Scholar
Ertekin, S., Huang, J., Bottou, L., and Giles, L.. 2007. “Learning on the Border: Active Learning in Imbalanced Data Classification.” In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management , 127136. New York: Association for Computing Machinery.CrossRefGoogle Scholar
Freund, Y., and Schapire, R. E.. 1997. “A Decision-theoretic Generalization of on-line Learning and an Application to Boosting.” Journal of Computer and System Sciences 55(1):119139.CrossRefGoogle Scholar
Freund, Y., Seung, H. S., Shamir, E., and Tishby, N.. 1997. “Selective Sampling Using the Query by Committee Algorithm.” Machine Learning 28(2–3):133168.CrossRefGoogle Scholar
Freytag, A., Rodner, E., and Denzler, J.. 2014. “Selecting Influential Examples: Active Learning with Expected Model Output Changes.” In European Conference on Computer Vision , 562577. Cham: Springer.Google Scholar
Friedman, J., Hastie, T., and Tibshirani, R.. 2001. The Elements of Statistical Learning , Springer Series in Statistics, vol. 1. New York: Springer.Google Scholar
Grimmer, J. 2010. “A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases.” Political Analysis 18(1):135.CrossRefGoogle Scholar
Grimmer, J., and Stewart, B. M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3):267297.CrossRefGoogle Scholar
Hill, D. W., and Jones, Z. M.. 2014. “An Empirical Evaluation of Explanations for State Repression.” American Political Science Review 108(3):661687.CrossRefGoogle Scholar
Javed, J., and Miller, B.. 2018. “Mobilizing Hate: Moral-emotional Frames, Outrage, and Violent Expression in Online Media.” Unpublished.Google Scholar
Körner, C., and Wrobel, S.. 2006. “Multi-class Ensemble-based Active Learning.” In European Conference on Machine Learning , 687694. Berlin: Springer.Google Scholar
Kullback, S., and Leibler, R. A.. 1951. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22(1):7986.CrossRefGoogle Scholar
Lewis, D. D., and Catlett, J.. 1994. “Heterogeneous Uncertainty Sampling for Supervised Learning.” In Machine Learning Proceedings 1994 , 148156. Berlin: Springer.CrossRefGoogle Scholar
Linder, F.2017. “Improved Data Collection from Online Sources Using Query Expansion and Active Learning” (August 25, 2017). Available at SSRN: https://ssrn.com/abstract=3026393 or http://dx.doi.org/10.2139/ssrn.3026393.CrossRefGoogle Scholar
Lu, Z., and Bongard, J.. 2009. “Exploiting Multiple Classifier Types With Active Learning.” In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation , 19051906. New York: Association for Computing Machinery.Google Scholar
Melville, P., and Mooney, R. J.. 2004. “Diverse Ensembles for Active Learning.” In Proceedings of the Twenty-first International Conference on Machine Learning , 74. New York: Association for Computing Machinery.Google Scholar
Mikhaylov, S., Laver, M., and Benoit, K. R.. 2012. “Coder Reliability and Misclassification in the Human Coding of Party Manifestos.” Political Analysis 20(1):7891.CrossRefGoogle Scholar
Miller, B., Linder, F., and Mebane, W.. 2019. “Replication Data for: Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches.” https://doi.org/10.7910/DVN/T88EAX, Harvard Dataverse, V1.CrossRefGoogle Scholar
Mitchell, T. M. 1982. “Generalization as Search.” Artificial intelligence 18(2):203226.CrossRefGoogle Scholar
Muchlinski, D., Siroky, D., He, J., and Kocher, M.. 2016. “Comparing Random Forest with Logistic Regression for Predicting Class-imbalanced Civil War Onset Data.” Political Analysis 24(1):87103.CrossRefGoogle Scholar
Nguyen, H. T., and Smeulders, A.. 2004. “Active Learning Using Pre-clustering.” In Proceedings of the Twenty-first International Conference on Machine Learning , 79. New York: Association for Computing Machinery.Google Scholar
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., and Radev, D. R.. 2010. “How to Analyze Political Attention With Minimal Assumptions and Costs.” American Journal of Political Science 54(1):209228.10.1111/j.1540-5907.2009.00427.xCrossRefGoogle Scholar
Roberts, M. E., Stewart, B. M., Tingley, D., and Airoldi, E. M. et al. . 2013. “The Structural Topic Model and Applied Social Science.” In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation . Red Hook, NY: Curran Associates.Google Scholar
Roy, N., and McCallum, A.. 2001. “Toward Optimal Active Learning Through Monte Carlo Estimation of Error Reduction.” In Proceedings of Eighteenth International Conference on Machine Learning , 441448. San Francisco, CA: Morgan Kaufmann Publishers Inc. Google Scholar
Schohn, G., and Cohn, D.. 2000. “Less is More: Active Learning with Support Vector Machines.” In ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning , 839846. San Francisco, CA: Morgan Kaufmann Publishers Inc. Google Scholar
Settles, B. 2012. “Active Learning.” Synthesis Lectures on Artificial Intelligence and Machine Learning 6(1):1114.10.2200/S00429ED1V01Y201207AIM018CrossRefGoogle Scholar
Settles, B., and Craven, M.. 2008. “An Analysis of Active Learning Strategies for Sequence Labeling Tasks.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing , 10701079. San Rafael, CA: Morgan & Claypool.Google Scholar
Settles, B., Craven, M., and Ray, S.. 2008. “Multiple-instance Active Learning.” In Advances in Neural Information Processing Systems , 12891296. New York: Association for Computing Machinery.Google Scholar
Seung, H. S., Opper, M., and Sompolinsky, H.. 1992. “Query by Committee.” In Proceedings of the Fifth Annual Workshop on Computational Learning Theory , 287294. New York: Association for Computing Machinery.Google Scholar
Sun, Y., Wong, A. K., and Kamel, M. S.. 2009. “Classification of Imbalanced Data: A Review.” International Journal of Pattern Recognition and Artificial Intelligence 23(04):687719.10.1142/S0218001409007326CrossRefGoogle Scholar
Tong, S., and Koller, D.. 2001. “Support Vector Machine Active Learning With Applications to Text Classification.” Journal of Machine Learning Research 2(Nov):4566.Google Scholar
Wallach, H. M., Murray, I., Salakhutdinov, R., and Mimno, D.. 2009. “Evaluation Methods for Topic Models.” In Proceedings of the 26th Annual International Conference on Machine Learning , 11051112. New York: Association for Computing Machinery.Google Scholar
Wilkerson, J., and Casas, A.. 2017. “Large-scale Computerized Text Analysis in Political Science: Opportunities and Challenges.” Annual Review of Political Science 20:529544.CrossRefGoogle Scholar
Wilkerson, J., Smith, D., and Stramp, N.. 2015. “Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach.” American Journal of Political Science 59(4):943956.CrossRefGoogle Scholar
Wolpert, D. H. 1996. “The Lack of a priori Distinctions Between Learning Algorithms.” Neural Computation 8(7):13411390.CrossRefGoogle Scholar
Workman, S. 2015. The Dynamics of Bureaucracy in the US Government: How Congress and Federal Agencies Process Information and Solve Problems . Cambridge: Cambridge University Press.10.1017/CBO9781107447752CrossRefGoogle Scholar
Zhao, L., Sukthankar, G., and Sukthankar, R.. 2011. “Incremental Relabeling for Active Learning With Noisy Crowdsourced Annotations.” In Privacy, Security, Risk and Trust (Passat) and 2011 IEEE Third Inernational Conference on Social Computing (Socialcom), 2011 IEEE Third International Conference on , 728733. Boston: Institute of Electrical and Electronics Engineers.10.1109/PASSAT/SocialCom.2011.193CrossRefGoogle Scholar
Supplementary material: File

Miller et al. supplementary material

Miller et al. supplementary material

Download Miller et al. supplementary material(File)
File 1.7 MB