Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches

Jonathan Kropko; Ben Goodrich; Andrew Gelman; Jennifer Hill

doi:10.1093/pan/mpu007

Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches

Published online by Cambridge University Press: 04 January 2017

Andrew Gelman and

Jonathan Kropko*: Affiliation:
Woodrow Wilson Department of Politics, University of Virginia, 1540 Jefferson Park Avenue, Charlottesville, VA 22903
Ben Goodrich: Affiliation:
Department of Political Science, Columbia University, 420 W. 118th St., Mail Code 3320, New York, NY 10027. e-mail: [email protected]
Andrew Gelman: Affiliation:
Departments of Statistics and Political Science, Columbia University, 1255 Amsterdam Avenue, Room 1016, New York, NY 10027. e-mail: [email protected]
Jennifer Hill: Affiliation:
Department of Humanities and Social Sciences, New York University Steinhardt, 246 Greene Street, Room 804, New York, NY 10003. e-mail: [email protected]
*: e-mail: [email protected] (corresponding author)

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

We consider the relative performance of two common approaches to multiple imputation (MI): joint multivariate normal (MVN) MI, in which the data are modeled as a sample from a joint MVN distribution; and conditional MI, in which each variable is modeled conditionally on all the others. In order to use the multivariate normal distribution, implementations of joint MVN MI typically assume that categories of discrete variables are probabilistically constructed from continuous values. We use simulations to examine the implications of these assumptions. For each approach, we assess (1) the accuracy of the imputed values; and (2) the accuracy of coefficients and fitted values from a model fit to completed data sets. These simulations consider continuous, binary, ordinal, and unordered-categorical variables. One set of simulations uses multivariate normal data, and one set uses data from the 2008 American National Election Studies. We implement a less restrictive approach than is typical when evaluating methods using simulations in the missing data literature: in each case, missing values are generated by carefully following the conditions necessary for missingness to be “missing at random” (MAR). We find that in these situations conditional MI is more accurate than joint MVN MI whenever the data include categorical variables.

Type: Research Article
Information: Political Analysis , Volume 22 , Issue 4 , Autumn 2014 , pp. 497 - 519

DOI: https://doi.org/10.1093/pan/mpu007 [Opens in a new window]
Copyright: Copyright © The Author 2014. Published by Oxford University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Authors' note: An earlier version of this study was presented at the Annual Meeting of the Society for Political Methodology, Chapel Hill, NC, July 20, 2012. Replication code and data are available on the Political Analysis Dataverse, and the full citation to the replication material is included in the references. We thank Yu-sung Su, Yajuan Si, Sonia Torodova, Jingchen Liu, Michael Malecki, and two anonymous reviewers for their comments.

References

American National Election Studies (ANES; www.electionstudies.org). The ANES 2008 Time Series Study [data set]. Stanford University and the University of Michigan [producers].Google Scholar

Bernaards, Coen A., Belin, Thomas R., and Schafer, Joseph L. 2007. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Statistics in Medicine 26(6): 1368–82.Google Scholar

Cranmer, Skyler J., and Gill, Jeff. 2013. We have to be discrete about this: A non-parametric imputation technique for missing categorical data. British Journal of Political Science 43(2): 425–49.Google Scholar

Cribari-Neto, Francisco, and Zeileis, Achim. 2010. Beta regression in R. Journal of Statistical Software 34(2): 1–24.Google Scholar

Demirtas, Hakan. 2010. A distance-based rounding strategy for post-imputation ordinal data. Journal of Applied Statistics 37(3): 489–500.Google Scholar

Dempster, Arthur P., Laird, Nan, and Rubin, Donald B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 39(1): 1–38.CrossRef Google Scholar

Gelman, Andrew, Jakulin, Aleks, Pittau, Maria Grazia, and Su, Yu-Sung. 2008. A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics 2(4): 1360–83.Google Scholar

Gelman, Andrew, Su, Yu-Sung, Yajima, Masanao, Hill, Jennifer, Grazia Pittau, Maria, Kerman, Jouni, and Zheng, Tian. 2012. arm: Data analysis using regression and multilevel/hierarchical models. R package version 1.5–05. http://CRAN.R-project.org/package=arm.Google Scholar

Goodrich, Ben, Kropko, Jonathan, Gelman, Andrew, and Hill, Jennifer. 2012. mi: Iterative multiple imputation from conditional distributions. R package version 2.15.1.Google Scholar

Greenland, Sander, and Finkle, William D. 1995. A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology 142(12): 1255–64.Google Scholar

Honaker, James, and King, Gary. 2010. What to do about missing values in time-series cross-section data. American Journal of Political Science 54(2): 561–81.Google Scholar

Honaker, James, King, Gary, and Blackwell, Matthew. 2011. Amelia II: A program for missing data. Journal of Statistical Software 45(7): 1–47.Google Scholar

Honaker, James, King, Gary, and Blackwell, Matthew. 2012. Amelia II: A program for missing data. Software documentation, version 1.6.2. http://r.iq.harvard.edu/docs/amelia/amelia.pdf.Google Scholar

Horton, Nicholas J., Lipsitz, Stuart R., and Parzen, Michael. 2003. A potential for bias when rounding in multiple imputation. American Statistician 57(4): 229–32.Google Scholar

Kropko, Jonathan, Goodrich, Ben, Gelman, Andrew, and Hill, Jennifer. 2014. Replication data for: Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches, http://dx.doi.org/10.7910/DVN/24672UNF:5:QuxE8nFhbW2JZT+OW9WzWw==IQSS Dataverse Network [Distributor] V1 [Version].CrossRef Google Scholar

Lee, Katherine J., and Carlin, John B. 2010. Multiple imputation for missing data: Fully conditional specification versus multivariate normal imputation. American Journal of Epidemiology 171(5): 624–32.CrossRef Google Scholar PubMed

Lewandowski, Daniel, Kurowicka, Dorota, and Joe, Harry. 2010. Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis 100(9): 1989–2001.CrossRef Google Scholar

Li, Fan, Yu, Yaming, and Rubin, Donald B. 2012. Imputing missing data by fully conditional models: Some cautionary examples and guidelines. Working paper. ftp.stat.duke.edu/WorkingPapers/11-24.pdf. Accessed 7 December 2012.Google Scholar

Royston, Patrick. 2005. Multiple imputation of missing values: Update. Stata Journal 5(2): 188–201.Google Scholar

Royston, Patrick. 2007. Multiple imputation of missing values: Further update of ice, with an emphasis on interval censoring. Stata Journal 7(4): 445–74.Google Scholar

Royston, Patrick. 2009. Multiple imputation of missing values: Further update of ice, with an emphasis on categorical variables. Stata Journal 9(3): 466–77.CrossRef Google Scholar

Rubin, Donald B. 1978. Multiple imputations in sample surveys. Proceedings of the Survey Research Methods Section of the American Statistical Association.Google Scholar

Rubin, Donald B. Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics 4(1): 87–94.Google Scholar

Rubin, Donald B. 1987. Multiple imputation for nonresponse in surveys. New York: John Wiley and Sons.Google Scholar

Rubin, Donald B., and Little, Roderick J. A. 2002. Statistical analysis with missing data. 2nd ed. New York: John Wiley and Sons.Google Scholar

Schafer, Joseph L. 1997. Analysis of incomplete multivariate data. London: Chapman & Hall.CrossRef Google Scholar

Schafer, Joseph L., and Olsen, Maren K. 1998. Multiple imputation for multivariate missing-data problems: A data analyst's perspective. Multivariate Behavioral Research 33(4): 545–71.CrossRef Google Scholar PubMed

StataCorp. 2013. Stata 13 base reference manual. College Station, TX: Stata Press.Google Scholar

Su, Yu-Sung, Gelman, Andrew, Hill, Jennifer, and Yajima, Masanao. 2011. Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. Journal of Statistical Software 45(2): 1–31.Google Scholar

Therneau, Terry. 2012. survival: A package for survival analysis in S. R package version 2.36–14.Google Scholar

van Buuren, Stef. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research 16(3): 219–42.CrossRef Google Scholar PubMed

van Buuren, Stef. 2012. Flexible imputation of missing data. Boca Raton, FL: Chapman & Hall/CRC.CrossRef Google Scholar

van Buuren, Stef, Boshuizen, Hendriek C., and Knook, D. L. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18(6): 681–94.Google Scholar

van Buuren, Stef, and Groothuis-Oudshoorn, Karin. 2011. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software 45(3): 1–67.Google Scholar

Venables, William N., and Ripley, Brian D. 2002. Modern applied statistics with S. 4th ed. New York: Springer.Google Scholar

Yu, L-M, Burton, Andrea, and Rivero-Arias, Oliver. 2007. Evaluation of software for multiple imputation of semi-continuous data. Statistical Methods in Medical Research 16(3): 243–58.CrossRef Google Scholar PubMed

Yuan, Yang C. 2013. Multiple imputation for missing data: Concepts and new development (Version 9.0). SAS Software Technical Papers.Google Scholar

Article contents

Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests