Lost in Space: Geolocation in Event Data

Sophie J. Lee; Howard Liu; Michael D. Ward

doi:10.1017/psrm.2018.23

Lost in Space: Geolocation in Event Data

Published online by Cambridge University Press: 06 July 2018

Sophie J. Lee ,

Howard Liu and

Michael D. Ward

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.

Type: Original Articles
Information: Political Science Research and Methods , Volume 7 , Issue 4 , October 2019 , pp. 871 - 888

DOI: https://doi.org/10.1017/psrm.2018.23 [Opens in a new window]
Copyright: © The European Political Science Association 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Sophie J. Lee, Ph.D. Department of Political Science, Duke University, 140 Science Drive, Durham, North Carolina 27708, USA ([email protected]). Howard Liu, Ph.D. Candidate, Department of Political Science, Duke University, 140 Science Drive, Durham, North Carolina 27708, USA ([email protected]). Michael D. Ward, Professor of Political Science, Department of Political Science, Duke University, 140 Science Drive, Durham, North Carolina 27708, USA ([email protected]). The authors would like to thank John Beieler, Patrick Brandt, Andrew B. Hall, Andrew Halterman, Jan H. Pierskalla, and Philip A. Schrodt, as well as members of Wardlab for their insights and comments on this project. The editors and reviewers of this journal provided helpful suggestions. M.W. acknowledges support from National Science Foundation (NSF) Award 1259266. To view supplementary material for this article, please visit https://doi.org/10.1017/psrm.2018.23

References

REFERENCES

Beieler, John. 2016. ‘Creating a Real-Time, Reproducible Event Dataset’. arXiv preprint arXiv:1612.00866. https://arxiv.org/abs/1612.00866, accessed 1 May 2017.Google Scholar

Berico Technologies. 2017. ‘CLAVIN’. Available at https://clavin.bericotechnologies.com/, accessed 1 May 2017.Google Scholar

Boschee, Elizabeth, Jennifer Lautenschlager, Sean O'Brien, Steve Shellman, James Starz, and Michael Ward. 2015. ‘ICEWS Coded Event Data’. Available at http://dx.doi.org/10.7910/DVN/28075, accessed 1 May 2017. Google Scholar

Cederman, Lars-Erik, and Gleditsch, Kristian Skrede. 2009. ‘Introduction to Special Issue on “Disaggregating Civil War”’. Journal of Conflict Resolution 53(4):487–495.Google Scholar

Chen, Xue-wen, and Jong Cheol Jeong. 2007. ‘Enhanced Recursive Feature Elimination’. Sixth International Conference on Machine Learning and Applications, IEEE. Cincinnati, OH. 13–15 December, 2007. Google Scholar

Cristianini, Nello, and Shawe-Taylor, John. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. New York, NY: Cambridge University Press.Google Scholar

Dasarthy, Belur V. 1990. Nearest Neighbor Pattern Classification Techniques. Hoboken, NJ: IEEE Computer Society Press.Google Scholar

D’Ignazio, Catherine, Rahul Bhargava, Ethan Zuckerman, and Luisa Beck. 2014. ‘Cliff-Clavin: Determining Geographic Focus for News Articles’. KDD ’14: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining., Association for Computing Machinery. New York, NY. 24 August, 2014. Google Scholar

D’Orazio, Vito, Steven Landis, Glenn Palmer, and Philip Schrodt. 2014. ‘Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines’. Political Analysis 22(2):224–42. Google Scholar

Feinerer, Ingo, Kurt Hornik, and Mike Wallace. 2016. ‘Package wordnet’. R package Version 0.1-11. Available at https://cran.r-project.org/web/packages/wordnet/wordnet.pdf, accessed 1 May 2017.Google Scholar

Frank, Jonas, and Martinez-Vazquez, Jorge. 2014. ‘Decentralization and Infrastructure: From Gaps to Solutions’. Working Paper No. 14-05. Andrew Young School of Policy Studies, Georgia State University, International Center for Public Policy, Atlanta, GA.Google Scholar

Freund, Yoav, and Schapire, Robert. 1997. ‘A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting’. Journal of Computer and System Sciences 55:119–139.Google Scholar

GeoNames . 2017. http://geonames.org, accessed 1 May 2017.Google Scholar

Gerner, Debora, Philip A. Schrodt, and Omur Yilmaz. 2009. ‘Conflict and Mediation Event Observations (CAMEO): An Event Data Framework for a Post Cold War World’. In: Jacob Bercovitch and Scott Sigmund Gartner (eds), International Conflict Mediation: New Approaches and Findings, 287--304. New York: Routledge.Google Scholar

Grimmer, Justin, and Stewart, Brandon M.. 2013. ‘Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts’. Political Analysis 21(3):267–297.Google Scholar

Halterman, Andrew. 2016. ‘Mordecai: Full Text Geoparsing and Event Geocoding’. The Journal of Open Source Software. Available at http://dx.doi.org/10.21105/joss.00091, accessed 20 February 2017.Google Scholar

Halterman, Andrew, and Beieler, John. 2014. ‘A New, Near-Real-Time Event Dataset and the Role of Versioning’. Available at https://andrewhalterman.files.wordpress.com/2014/11/halterman-beieler_encore-event_data_and_versioning.pdf, accessed 1 May 2017.Google Scholar

Han, Xianpei, Le Sun, and Jun Zhao. 2011. ‘Collective Entity Linking in Web Text: A Graph-based Method’. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 765–774. Beijing, China. 24 July, 2011. Google Scholar

Honnibal, Matthew. 2016. ‘Spacy Usage: Entity Recognition’. Available at https://spacy.io/docs/usage/entity-recognition, accessed 1 May 2017.Google Scholar

Jones, Zachary, and Fridolin Linder. 2015. ‘Exploratory Data Analysis Using Random Forests’. Prepared for the 73rd Annual MPSA Conference. Chicago, IL. 16--19 April, 2015. Google Scholar

King, Davis E. 2009. ‘Dlib-ml: A Machine Learning Toolkit’. Journal of Machine Learning Research 10:1755–1758.Google Scholar

Kuhn, Max. 2016. ‘A Short Introduction to the Caret Package’. R package Version 1.6.8. Available at https://cran.r-project.org/web/packages/caret/caret.pdf, accessed 1 June 2018.Google Scholar

Lautenschlager, Jennifer, Starz, James, and Warfield, Ian. 2016. ‘A Statistical Approach to the Subnational Geolocation of Event Data’. In: Sae Schatz and Mark Hoffman (eds), Advances in Cross-Cultural Decision Making, Vol. 480, Advances in Intelligent Systems and Computing 333–343. Cham, Switzerland: Springer International Publishing.Google Scholar

Lautenschlager, Jennifer, Steve Shellman, and Michael D. Ward. 2015. ‘ICEWS Coded Event Aggregations’, Harvard Dataverse Network. Version 1. Available at http://dx.doi.org/10.7910/DVN/28117, accessed 1 June 2018.Google Scholar

Liaw, Andy, and Wiener, Matthew. 2002. ‘Classification and Regression by randomForest’. R news 2(3):18–22.Google Scholar

Liaw, Andy, and Matthew Weiner. 2016. ‘Package randomForest’. R package version 4.6-12. Available at https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, accessed 1 May 2017.Google Scholar

Manning, Christopher D. Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. ‘The Stanford CoreNLP Natural Language Processing Toolkit’. Association for Computational Linguistics (ACL) System Demonstrations, 55–60. Available at http://www.aclweb.org/anthology/P/P14/P14-5010.pdf, accessed 1 May 2017.Google Scholar

Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, Friedrich Leisch, Chih-Chung Chang, and Chih-Chen Lin. 2017. ‘e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien’. R Package Version 1.6.8. Available at https://cran.r-project.org/web/packages/e1071/e1071.pdf, accessed 1 May 2017.Google Scholar

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. ‘Distributed Representations of Words and Phrases and Their Compositionality’. In: Christopher Burges, Léon Bottou, Max Welling, Zoubin Ghahramani, and Kilian Weinberger (eds), Proceedings of Advances in Neural Information Processing Systems, Neural Information Processing Systems, 3111–3119. Stateline, NV. 5--10 December, 2013. Google Scholar

Morton, Thomas, Joern Kottmann, Jason Baldridge, and Gann Bierner. 2005. ‘OpenNLP: A Java-Based NLP Toolkit’. Available at http://opennlp.sourceforge.net, accessed 1 May 2017.Google Scholar

Müller, Berndt, and Reinhardt, Joachim. 2012. Neural Networks: An Introduction. Berlin: Springer Science & Business Media.Google Scholar

Murphy, Kevin P. 2006. ‘Naive Bayes Classifiers’. University of British Columbia. Available at https://datajobsboard.com/wp-content/uploads/2017/01/Naive-Bayes-Kevin-Murphy.pdf, accessed 1 June, 2018.Google Scholar

OEDA. 2016. ‘Real Time Event Data/Phoenix’. Available at http://eventdata.utdallas.edu/, accessed June 25 2018.Google Scholar

Penn State University Geo Vista Center. 2017. ‘Geo Txt’. Available at http://geotxt.org, accessed 1 May 2017.Google Scholar

Python Engine for Text Resolution And Related Coding Hierarchy (PETRARCH). 2017. https://github.com/openeventdata/petrarch, accessed 1 May 2017.Google Scholar

Porter, Martin F. 1980. ‘An Algorithm for Suffix Stripping’. Program 14(3):130–137.Google Scholar

Rao, Delip, Paul McNamee, and Mark Dredze. 2013. ‘Entity Linking: Finding Extracted Entities in a Knowledge Base’. In: Thierry Poibeau, Horacio Saggion, Jakub Piskorski, and Roman Yangarber (eds), Multi-Source, Multilingual Information Extraction and Summarization, 93–115. New York: Springer Science and Business & Media. Google Scholar

Schrodt, Philip A. 2006. ‘Twenty Years of the Kansas Event Data System Project’. The Political Methodologist 14(1):2–8.Google Scholar

Schrodt, Philip A. 2015. ‘Event Data in Forecasting Models: Where Does it Come From, What Can It Do?’ Unpublished Manuscript.Google Scholar

Schrodt, Philip A., and Yonamine, James E.. 2012. ‘Automated Coding of Very Large Scale Political Event Data’. New Directions in Text as Data Workshop, Harvard.Google Scholar

Shellman, Stephen M. 2008. ‘Coding Disaggregated Intrastate Conflict: Machine Processing the Behavior of Substate Actors Over Time and Space’. Political Analysis 16(4):464–477.Google Scholar

Steven, Bird, Klein, Ewan, and Loper, Edward. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol, CA: O’Reilly Media.Google Scholar

Weidmann, Nils. 2015. ‘On the Accuracy of Media-Based Conflict Event Data’. Journal of Conflict Resolution 59(6):1129–1149.Google Scholar

Lee et al. supplementary material

Online Appendix

PDF 722.6 KB

Lee et al. Dataset

Dataset

https://doi.org/10.7910/DVN/U4Q0FR

Link

Article contents

Lost in Space: Geolocation in Event Data

Abstract

Access options

Article purchase

Temporarily unavailable

Footnotes

References

REFERENCES

Lee et al. supplementary material

Lee et al. Dataset

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests