We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data

Skyler J. Cranmer; Jeff Gill

doi:10.1017/S0007123412000312

We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data

Published online by Cambridge University Press: 19 July 2012

Skyler J. Cranmer and

Jeff Gill

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Missing values are a frequent problem in empirical political science research. Surprisingly, the match between the measurement of the missing values and the correcting algorithms applied is seldom studied. While multiple imputation is a vast improvement over the deletion of cases with missing values, it is often unsuitable for imputing highly non-granular discrete data. We develop a simple technique for imputing missing values in such situations, which is a variant of hot deck imputation, drawing from the conditional distribution of the variable with missing values to preserve the discrete measure of the variable. This method is tested against existing techniques using Monte Carlo analysis and then applied to real data on democratization and modernization theory. Software for our imputation technique is provided in a free, easy-to-use package for the R statistical environment.

Type: Articles
Information: British Journal of Political Science , Volume 43 , Issue 2 , April 2013 , pp. 425 - 449

DOI: https://doi.org/10.1017/S0007123412000312 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Department of Political Science, University of North Carolina; and Department of Political Science, Washington University (email: [email protected]), respectively. The authors wish to thank Micah Altman, James Fowler, Katie Gan, Adam Glynn, Justin Grimmer, Dominik Hangartner, Michael Kellerman, Gary King, Ryan Moore and Randolph Siverson for valuable comments. Replication data is available at http://www.unc.edu/~skylerc/.

References

¹ The term ‘missing data’ can mean either missing values (e.g. item non-response in a survey) or missing observations such as refusal to take an entire survey. Throughout this work, we use the term exclusively to mean the first case.

² Taagepera, Rein and Shugart, Matthew Soberg, Seats and Votes: The Effects and Determinants of Electoral Systems (New Haven, Conn.: Yale University Press, 1989)Google Scholar

³ Peter Mair and Ingrid van Biezen, ‘Party Membership in Twenty European Democracies, 1980–2000’, Party Politics, 7 (2001), 5–21CrossRef Google Scholar

⁴ Palmer, Harvey D. and Whitten, Guy D., ‘The Electoral Impact of Unexpected Inflation and Economic Growth’, British Journal of Political Science, 29 (1999), 623–639CrossRef Google Scholar

⁵ Reiter, Dan, ‘Does Peace Nurture Democracy?’ Journal of Politics, 63 (2001), 935–948CrossRef Google Scholar

⁶ Tsiatis, Anastasios A., Semiparametric Theory and Missing Data (New York: Springer, 2010)Google Scholar

Enders, Craig K., Applied Missing Data Analysis (New York: The Guilford Press, 2010)Google Scholar

Tan, Ming T.Tian, Guo-Liang and Ng, Kai Wang, Bayesian Missing Data Problems: EM, Data Augmentation and Noniterative Computation (New York: Chapman & Hall/CRC, 2009)CrossRef Google Scholar

Molenberghs, Geert and Kenward, Michael G., Missing Data in Clinical Studies (New York: Wiley, 2007)CrossRef Google Scholar

McKnight, Patrick E., McKnight, Katherine M.Sidani, Souraya and Figueredo, Aurelio Jose, Missing Data: A Gentle Approach (New York: The Guilford Press, 2007)Google Scholar

⁷ Rees, Phil H. and Duke-Williams, Oliver, ‘Methods for Estimating Missing Data on Migrants in the 1991 British Census’, International Journal of Population Geography, 3 (1997), 323–3683.0.CO;2-Z>CrossRef Google Scholar PubMed

⁸ Rees and Duke-Williams, ‘Methods for Estimating Missing Data on Migrants in the 1991 British Census’.

⁹ Roderick J. A. Little and Donald B. Rubin, Statistical Analysis with Missing Data, 2nd edn (New York: Wiley, 2002), p. 42Google Scholar

¹⁰ Allison, Paul D., Missing Data (Thousand Oaks, Calif.: Sage, 2001)Google Scholar

Little, Roderick J. A., ‘Regression with Missing X's: A Review’, Journal of the American Statistical Association, 87 (1992), 1227–1237Google Scholar

Little, Roderick J. A., ‘Approximately Calibrated Small Sample Inference about Means from Bivariate Normal Data with Missing Values’, Computational Statistics & Data Analysis, 7 (1988), 161–178CrossRef Google Scholar

Rubin, Donald B., ‘Inference and Missing Data (with Discussion)’, Biometrika, 63 (1976), 581–592CrossRef Google Scholar

King, Gary, Honaker, JamesJoseph, Anne and Scheve, Kenneth, ‘Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation’, American Political Science Review, 95 (2001), 49–69CrossRef Google Scholar

¹¹ Honaker, James and King, Gary, ‘What to Do about Missing Values in Time-Series Cross-Section Data’, American Journal of Political Science, 54 (2010), 561–581CrossRef Google Scholar

¹² Rubin, ‘Inference and Missing Data’; King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Little and Rubin, Statistical Analysis with Missing Data.

¹³ Little and Rubin, Statistical Analysis with Missing Data, p. 12.

¹⁴ King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’.

¹⁵ Gelman, Andrew and Hill, Jennifer, Data Analysis Using Regression and Multilevel/Hierarchical Models (New York: Cambridge University Press, 2007)Google Scholar

¹⁶ Bailar, John C. III and Bailar, Barbara A., ‘Comparison of the Biases of the “Hot Deck” Imputation Procedure with an “Equal Weights” Imputation Procedure’, Symposium on Incomplete Data: Panel on Incomplete Data of the Committee on National Statistics, National Research Council, 1997), 422–47Google Scholar

Cox, Brenda. G., ‘The Weighted Sequential Hot Deck Imputation Procedure’, Proceedings of the Section on Survey Research Methods, American Statistical Association (1980), 721–6Google Scholar

Rockwell, Richard C., ‘An Investigation of Imputation and Differential Quality of Data in the 1970 Census’, Journal of the American Statistical Association, 70 (1975), 39–42CrossRef Google Scholar

¹⁷ Rubin, Donald B., Multiple Imputation for Nonresponse in Surveys (New York: Wiley, 2004)Google Scholar

¹⁸ Rubin, ‘Inference and Missing Data’.

¹⁹ Rubin, Donald B., ‘Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys’, Journal of the American Statistical Association, 72 (1977), 538–543CrossRef Google Scholar

Rubin, Donald B., ‘Multiple Imputations in Sample Surveys: A Phenomenological Bayesian Approach to Nonresponse’, Proceedings of the Survey Research Methods Section of the American Statistical Association (1978), 20–34Google Scholar

Rubin, Donald B. and Schenker, Nathaniel, ‘Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse’, Journal of the American Statistical Association, 81 (1986), 366–374CrossRef Google Scholar

Rubin, Donald B., ‘Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations’, Journal of Business and Economic Statistics, 4 (1986), 87–94Google Scholar

Rubin, Donald B.Schafer, J. L. and Schenker, Nathaniel, ‘Imputation Strategies for Missing Values in Post-Enumeration Surveys’, Survey Methodology, 14 (1988), 209–221Google Scholar

Rubin, Donald B., ‘Multiple Imputation after 18+ Years’, Journal of the American Statistical Association, 91 (1996), 473–489CrossRef Google Scholar

²⁰ The combined $$\[-->$<>{{\bar{\theta }}_{{\bi M}}} <$> <!--\]$$ is in fact an average, but the treatment of the variability of this estimate is slightly more complicated than an average since it needs to account for within imputation variation and between imputation variation. The subject of multiple estimate combination will be discussed in some detail below. See Little and Rubin, Statistical Analysis with Missing Data, for a more detailed treatment.

²¹ Kim, Jae Kwang, ‘Finite Sample Properties of Multiple Imputation Estimators’, Annals of Statistics, 32 (2004), 766–783CrossRef Google Scholar

Kim, Jae Kwang and Fuller, Wayne, ‘Fractional Hot Deck Imputation’, Biometrika, 91 (2004), 559–578CrossRef Google Scholar

Fuller, Wayne and Kim, Jae Kwang, ‘Hot Deck Imputation for the Response Model’, Statistics Canada, 31 (2005), 139–149Google Scholar

²² Schafer, Joseph L., Analysis of Incomplete Multivariate Data (New York: Chapman & Hall/CRC, 1997)CrossRef Google Scholar

²³ King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Honaker and King, ‘What to Do about Missing Values in Time-Series Cross-Section Data’.

²⁴ The articles describing the Amelia procedure have received over 330 ISI citations as of this writing.

²⁵ Reilly, Marie, ‘Data Analysis Using Hot Deck Multiple Imputation’, The Statistician, 42 (1993), 307–313CrossRef Google Scholar

²⁶ Kalton, Graham and Kish, Leslie, ‘Some Efficient Random Imputation Methods’, Communications in Statistics – Theory and Methods, 13 (1984), 1919–1939CrossRef Google Scholar

Fay, Robert E., ‘Alternative Paradigms for the Analysis of Imputed Survey Data’, Journal of the American Statistical Association, 91 (1996), 490–498CrossRef Google Scholar

²⁷ Reilly, ‘Data Analysis Using Hot Deck Multiple Imputation’.

²⁸ Reilly, ‘Data Analysis Using Hot Deck Multiple Imputation’.

²⁹ For linguistic parsimony, we generally use the term ‘respondent’ below, but these methods are immediately applicable to datasets where the rows reflect any other type of observation.

³⁰ Gower, J. C., ‘A General Coefficient of Similarity and Some of its Properties’, Biometrics, 27 (1971), 857–871CrossRef Google Scholar

³¹ Rosenbaum, Paul R. and Rubin, Donald B., ‘The Central Role of the Propensity Score in Observational Studies for Causal Effects’, Biometrika, 70 (1983), 41–55CrossRef Google Scholar

³² Kim, ‘Finite Sample Properties of Multiple Imputation Estimators’; Kim and Fuller, ‘Fractional Hot Deck Imputation’; Fuller and Kim, ‘Hot Deck Imputation for the Response Model’.

³³ Kim, ‘Finite Sample Properties of Multiple Imputation Estimators’.

³⁴ Little and Rubin, Statistical Analysis with Missing Data; Rubin, ‘Multiple Imputations in Sample Surveys’; Rubin, Multiple Imputation for Nonresponse in Surveys; Rubin, ‘Multiple Imputation after 18+ Years’.

³⁵ Little and Rubin, Statistical Analysis with Missing Data.

³⁶ Our software formats its output so that the output can be used seamlessly with the R package Zelig; Koske Imai, Gary King and Olivia Lau, ‘Zelig: Everyone's Statistical Software’, Comprehensive R Archive Network (2006). This has the advantage of allowing the user to run, in a single line of code, a great variety of models on the multiple imputed datasets and have the combination handled automatically.

³⁷ King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Honaker and King, ‘What to Do about Missing Values in Time-Series Cross-Section Data’.

³⁸ Stef van Buuren, Jaap P. L. Brand, C. G. M. Groothuis-Oudshoorn and Donald B. Rubin, ‘Fully Conditional Specification in Multivariate Imputation’, Journal of Statistical Computation and Simulation, 76 (2006), 1049–1064CrossRef Google Scholar

Stef van Buuren, ‘Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification’, Statistical Methods in Medical Research, 16 (2007), 219–242CrossRef Google Scholar

³⁹ Dempster, A. P.Laird, N. M. and Rubin, D. B., ‘Maximum Likelihood from Incomplete Data via the EM Algorithm’, Journal of the Royal Statistical Society, Series B, 39 (1977), 493–510Google Scholar

⁴⁰ We also ran experiments where the missing values were MCAR, but, as we would expect theoretically, no method was biased under those conditions.

⁴¹ Lipset, Seymour M., ‘Some Social Requisites of Democracy: Economic Development and Political Legitimacy’, American Political Science Review, 53 (1959), 69–105CrossRef Google Scholar

⁴² Cutright, Phillips, ‘National Political Development: Its Measurement and Social Correlates’, in Nelson W. Polsby, Robert A. Dentler and Paul A. Smith, eds, Politics and Social Life: An Introduction to Political Behavior (Boston, Mass.: Houghton Mifflin, 1963), 569–581Google Scholar

Deutsch, Karl W., ‘Social Mobilization and Political Development’, American Political Science Review, 55 (1961), 493–510CrossRef Google Scholar

Dahl, Robert A., Polyarchy: Participation and Opposition (New Haven, Conn.: Yale University Press, 1971)Google Scholar

Burkhart, Ross E. and Lewis-Beck, Michael S., ‘The Economic Development Thesis’, American Political Science Review, 88 (1994), 903–910CrossRef Google Scholar

Londregan, John B. and Poole, Keith T., ‘Does High Income Promote Democracy?’ World Politics, 49 (1996) 1–30Google Scholar

⁴³ Przeworski, Adam, Democracy and the Market: Political and Economic Reforms in Eastern Europe (New York: Cambridge University Press, 1991)CrossRef Google Scholar

Przeworski, Adam, Democracy and the Market: Political and Economic Reforms in Eastern Europe (New York: Cambridge University Press, 1991)CrossRef Google Scholar

Przeworski, Adam and Limongi, Fernando, ‘Political Regimes and Economic Growth’, Journal of Economic Perspectives, 7 (1993), 51–69CrossRef Google Scholar

Przeworski, Adam, Alvarez, Michael E.Cheibub, Jose A. and Limongi, Fernando, ‘What Makes Democracies Endure?’ Journal of Democracy, 7 (1996), 39–55Google Scholar

Przeworski, Adam and Limongi, Fernando, ‘Modernization: Theories and Facts’, World Politics, 49 (1997), 155–183CrossRef Google Scholar

Przeworski, Adam, Alvarez, Michael E.Cheibub, Jose A. and Limongi, Fernando, Democracy and Development: Political Institutions and Well-Being in the World, 1950–1990 (New York: Cambridge University Press, 2000)CrossRef Google Scholar

⁴⁴ Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

⁴⁵ Boix, Carles, Democracy and Redistribution (New York: Cambridge University Press, 2002)Google Scholar

Boix, Carles and Stokes, Susan, ‘Endogenous Democratization’, World Politics, 55 (2003), 517–549CrossRef Google Scholar

Epstein, David L., Bates, Robert, Goldstone, JackKristensen, Ida and O'Halloran, Sharyn, ‘Democratic Transitions’, American Journal of Political Science, 50 (2006), 551–569CrossRef Google Scholar

⁴⁶ Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

⁴⁷ Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

⁴⁸ Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

⁴⁹ The true results are true to the extent that they are the results actually obtained by analysing the complete data. They are not true in the more traditional sense of being the true population parameters an empirical analysis attempts to estimate.

⁵⁰ Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

⁵¹ Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

⁵² Imai, King and Lau, ‘Zelig’.

Article contents

We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data

Abstract

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests