Hostname: page-component-586b7cd67f-dlnhk Total loading time: 0 Render date: 2024-11-24T07:49:27.461Z Has data issue: false hasContentIssue false

Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data

Published online by Cambridge University Press:  04 January 2017

David Muchlinski*
Affiliation:
School of Social and Political Science, University of Glasgow, Glasgow, UK
David Siroky
Affiliation:
Department of Political Science, Arizona State University, Tempe, AZ, e-mail: [email protected]
Jingrui He
Affiliation:
Department of Computer Science and Engineering, Arizona State University, Tempe, AZ, e-mail: [email protected]
Matthew Kocher
Affiliation:
Department of Political Science, Yale University, New Haven, CT, e-mail: [email protected]
*
e-mail: [email protected] (corresponding author)

Abstract

The most commonly used statistical models of civil war onset fail to correctly predict most occurrences of this rare event in out-of-sample data. Statistical methods for the analysis of binary data, such as logistic regression, even in their rare event and regularized forms, perform poorly at prediction. We compare the performance of Random Forests with three versions of logistic regression (classic logistic regression, Firth rare events logistic regression, and L1-regularized logistic regression), and find that the algorithmic approach provides significantly more accurate predictions of civil war onset in out-of-sample data than any of the logistic regression models. The article discusses these results and the ways in which algorithmic statistical methods like Random Forests can be useful to more accurately predict rare events in conflict data.

Type
Articles
Copyright
Copyright © The Author 2015. Published by Oxford University Press on behalf of the Society for Political Methodology 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Author's note: Replication data are available on the Political Analysis Dataverse at http://dx.doi.org/10.7910/DVN/KRKWK8.

References

Beck, N., King, G., and Zeng, L. 2000. Improving quantitative studies of international conflict: A conjecture. American Political Science Review 94(1): 2135.Google Scholar
Blair, R., Blattman, C., and Hartman, A. 2015. Predicting local violence. Social Science Research Network. revised url http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2497153 (accessed October 10, 2015).Google Scholar
Brandt, P., Freeman, J. R., and Schrodt, P. 2014. Evaluating forecasts of political conflict dynamics. International Journal of Forecasting 30:944–62.CrossRefGoogle Scholar
Breiman, L. 1996. Out-of-bag estimation. Technical report, Citeseer.Google Scholar
Breiman, L. 2001a. Random forests. Machine Learning 45(1): 532.CrossRefGoogle Scholar
Breiman, L. 2001b. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science 16(3): 199231.Google Scholar
Buuren, S., and Groothuis-Oudshoorn, K. 2011. MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software 45(3): 167.Google Scholar
Cederman, L.-E., Gleditsch, K. S., and Buhaug, H. 2013. Inequality, grievances, and civil war. Cambridge University Press.CrossRefGoogle Scholar
Chawla, N. V. 2005. Data mining for imbalanced datasets: An overview, 875–86. Springer.Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR) 16:321–57.Google Scholar
Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. 2003. Smoteboost: Improving prediction of the minority class in boosting. In Knowledge discovery in databases: PKDD 2003, 7th European conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia, September 22–26, 2003, Proceedings, volume 2838 of lecture notes in computer science, eds. Lavrac, N., Gamberger, D., Blockeel, H., and Todorovski, L., 107–19. Springer.Google Scholar
Chen, C., Liaw, A., and Breiman, L. 2004. Using random forest to learn imbalanced data. Berkeley: University of California.Google Scholar
Cieslak, D. A., and Chawla, N. V. 2008. Start globally, optimize locally, predict globally: Improving performance on imbalanced data. In Proceedings of the 8th IEEE international conference on data mining (ICDM 2008), December 15–19, 2008, Pisa, Italy, 143–52. Google Scholar
Clayton, G., and Gleditsch, K. S. 2014. Will we see helping hands? Predicting civil war mediation and likely success. Conflict Management and Peace Science 31:265–84.Google Scholar
Collier, P., and Hoeffler, A. 2004. Greed and grievance in civil war. Oxford Economic Papers 56(4): 563–95.CrossRefGoogle Scholar
Efron, B. 1983. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association 78(382): 316–31.CrossRefGoogle Scholar
Fawcett, T. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27(8): 861–74.Google Scholar
Fearon, J. D., and Laitin, D. D. 2003. Ethnicity, insurgency, and civil war. American Political Science Review 97(01): 7590.CrossRefGoogle Scholar
Firth, D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80(1): 2738.Google Scholar
Freiman, M. H. 2010. Using random forests and simulated annealing to predict probabilities of election to the Baseball Hall of Fame. Journal of Quantitative Analysis in Sports 6(2): 135.Google Scholar
Geisser, S. 1975. The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350): 320–8.CrossRefGoogle Scholar
Gelman, A., and Imbens, G. 2013. Why ask why? Forward causal inference and reverse causal questions. NBER working paper number 19614.Google Scholar
Gleditsch, K. S., and Ward, M. 2012. Forecasting is difficult, especially about the future: Using contentious issues to forecast interstate disputes. Journal of Peace Research 50(1): 1731.Google Scholar
Goldstone, J. A., Bates, R. H., Epstein, D. L., Gurr, T. R., Lustik, M. B., Marshall, M. G., Ulfelder, J., and Woodward, M. 2010. A global model for forecasting political instability. American Journal of Political Science 54(1): 190208.Google Scholar
Greenhill, B., Ward, M. D., and Sacks, A. 2011. The separation plot: A new visual method for evaluating the fit of binary models. American Journal of Political Science 55(4): 9911002.CrossRefGoogle Scholar
Hajjem, A., Bellavance, F., and Larocque, D. 2014. Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation 84(6): 1313–28.Google Scholar
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., and Tibshirani, R. 2009. The elements of statistical learning. Springer.Google Scholar
Hegre, H., Karlsen, J., NygWård, H. M., Strand, H., and Urdal, H. 2013. Predicting armed conflict, 2010–2050. International Studies Quarterly 57(2): 250–70.Google Scholar
Hegre, H., and Sambanis, N. 2006. Sensitivity analysis of empirical results on civil war onset. Journal of Conflict Resolution 50(4): 508–35.CrossRefGoogle Scholar
Hill, D. W., and Jones, Z. M. 2014. An empirical evaluation of explanations for state repression. American Political Science Review 108:661–87.Google Scholar
Hoff, P. D., and Ward, M. D., 2004. Modeling dependencies in international relations networks. Political Analysis 12(2): 160–75.Google Scholar
Holland, P. W. 1986. Statistical and causal inference. Journal of the American Statistical Association 81(396): 945–60.Google Scholar
Honaker, J., King, G., and Blackwell, M. 2011. Amelia ii: A program for missing data. Journal of Statistical Software 45(7): 147.CrossRefGoogle Scholar
Jones, Z., and Linder, F. 2015. Exploratory data analysis using random forests. Prepared for the 73rd annual MPSA conference, April 16–19, 2015. http://zmjones.com/static/papers/rfss_manuscript.pdf (accessed October 10, 2015).Google Scholar
Kalyvas, S. N. 2007. Civil wars In The Oxford handbook of comparative politics, eds. Boix, C. and Stokes, S., 416–34. Oxford University Press.Google Scholar
King, G., Keohane, R. O., and Verba, S. 1994. Designing social inquiry: Scientific inference in qualitative research. Princeton University Press.CrossRefGoogle Scholar
King, G., and Zeng, L. 2001. Logistic regression in rare events data. Political Analysis 9(2): 137–63.CrossRefGoogle Scholar
Köknar-Tezel, S., and Latecki, L. J. 2011. Improving SVM classification on imbalanced time series data sets with ghost points. Knowledge and Information System 28(1): 123.CrossRefGoogle Scholar
Lee, S., Lee, H., Abbeel, P., and Ng, A. Y. 2006. Efficient L1 regularized logistic regression. In Proceedings, The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, July 16–20, 2006, Boston, Massachusetts, USA, 401–8. Google Scholar
Liaw, A. 2015. Package “randomforest”. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf (accessed October 10, 2015).Google Scholar
Ling, C. X., and Li, C. 1998. Data mining for direct marketing: Problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York City, New York, USA, August 27–31, 1998, 73–9. Google Scholar
Montgomery, J. M., Hollenbach, F. M., and Ward, M. D. 2012. Improving predictions using ensemble Bayesian model averaging. Political Analysis 20(3): 271–91.Google Scholar
Muchlinski, D. 2015. Replication Data for: Comparing Random Forests with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data. http://dx.doi.org/10.7910/DVN/KRKWK8,HarvardDataverse,V1[UNF:6:pwv9cSHI53tZqXlrJ9EDaw== (accessed October 10, 2015).Google Scholar
Park, M. Y. and Hastie, T. 2007. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69:659–77.Google Scholar
Ravikumar, P., Wainwright, M. J., and Lafferty, J. D. 2010. High-dimensional Ising model selection using 11 regularized logistic regression. Annals of Statistics 38:1287–319.Google Scholar
Schrodt, P., Yonamine, J., and Bagozzi, B. E. 2013. Data-based computational approaches to forecasting political violence. In Handbook of computational approaches to counterterrorism, ed. Subrahmanian, V., 129–62.Google Scholar
Sela, R. J., and Simonoff, J. S. 2012. Re-em trees: A data mining approach for longitudinal and clustered data. Machine Learning 86:169207.CrossRefGoogle Scholar
Shellman, S. M., Levy, B. P., and Young, J. K. 2013. Shifting sands: Explaining and predicting phase shifts by dissident organizations. Journal of Peace Research 50:319–36.Google Scholar
Shmueli, G. 2010. To explain or predict? Statistical Science 25(3): 289310.Google Scholar
Siroky, D. 2009. Navigating random forests and related advanced in algorithmic modeling. Statistics Surveys 3:147–63.CrossRefGoogle Scholar
Spirling, A. 2008. Rebels with a cause? Legislative activity and the personal vote in Britain. Working Paper.Google Scholar
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. 2008. Conditional variable importance for random forests. BMC Bioinformatics 9(1): 307.CrossRefGoogle ScholarPubMed
Sun, Y., Kamel, M. S., and Wang, Y. 2006. Boosting for learning multiple classes with imbalanced class distribution. In Proceedings of the 6th IEEE international conference on data mining (ICDM 2006), 18–22 December 2006, Hong Kong, China, 592–602. IEEE Computer Society. Google Scholar
Ward, M., Siverson, R., and Cao, X. 2007. Disputes, democracies, and dependencies: A reexamination of the Kantian peace. American Journal of Political Science 51(3): 583601.Google Scholar
Ward, M. D., Greenhill, B. D., and Bakke, K. M. 2010. The perils of policy by p-value: Predicting civil conflicts. Journal of Peace Research 47(4): 363–75.Google Scholar
Ward, M. D., and Hoff, P. D. 2007. Persistent patterns of international commerce. Journal of Peace Research 44(2): 157–75.CrossRefGoogle Scholar
Ward, M. D., Metternich, N. W., Dorff, C., Gallop, M., Hollenbach, F. M., Schultz, A., and Weschle, S. 2012. Learning from the past and stepping into the future: The next generation of crisis predition. International Studies Review 15(4): 473–90.Google Scholar
Weidmann, N. B. 2008. Conflict prediction via machine learning: Addressing the rare events problem with bagging. Poster presented at the 25th annual summer conference of the society for political methodology.Google Scholar
Zorn, C. 2005. A solution to separation in binary response models. Political Analysis 13(2): 157–70.Google Scholar