Hostname: page-component-745bb68f8f-v2bm5 Total loading time: 0 Render date: 2025-01-20T18:13:33.191Z Has data issue: false hasContentIssue false

Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset

Published online by Cambridge University Press:  20 April 2022

Georgina Evans
Affiliation:
Department of Government, Harvard University, Cambridge, MA 02138, USA. E-mail: [email protected], URL: https://Georgina-Evans.com
Gary King*
Affiliation:
Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA. E-mail: [email protected], URL: https://GaryKing.org
*
Corresponding author Gary King

Abstract

We offer methods to analyze the “differentially private” Facebook URLs Dataset which, at over 40 trillion cell values, is one of the largest social science research datasets ever constructed. The version of differential privacy used in the URLs dataset has specially calibrated random noise added, which provides mathematical guarantees for the privacy of individual research subjects while still making it possible to learn about aggregate patterns of interest to social scientists. Unfortunately, random noise creates measurement error which induces statistical bias—including attenuation, exaggeration, switched signs, or incorrect uncertainty estimates. We adapt methods developed to correct for naturally occurring measurement error, with special attention to computational efficiency for large datasets. The result is statistically valid linear regression estimates and descriptive statistics that can be interpreted as ordinary analyses of nonconfidential data but with appropriately larger standard errors.

Type
Article
Copyright
© The Author(s) 2022. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Edited by Jeff Gill

References

Aktay, A. et al. 2020. “Google COVID-19 Community Mobility Reports: Anonymization Process Description (Version 1.0).” arXiv:2004.04145.Google Scholar
Barrientos, A. F., Reiter, J., Ashwin, M., and Chen, Y.. 2019. “Differentially Private Significance Tests for Regression Coefficients.” Journal of Computational and Graphical Statistics 28 (2):124.CrossRefGoogle Scholar
Blackwell, M., Honaker, J., and King, G.. 2017. “A Unified Approach to Measurement Error and Missing Data: Overview.” Sociological Methods and Research 46 (3): 303341.CrossRefGoogle Scholar
Blair, G., Imai, K., and Zhou, Y.-Y.. 2015. “Design and Analysis of the Randomized Response Technique.” Journal of the American Statistical Association 110 (511): 13041319.CrossRefGoogle Scholar
Bun, M. and Steinke, T.. 2016. “Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds.” In Theory of Cryptography Conference, 635658. Berlin: Springer.CrossRefGoogle Scholar
Buonaccorsi, J. P. 2010. Measurement Error: Models, Methods, and Applications. Boca Raton, FL: CRC Press.CrossRefGoogle Scholar
Diaz-Frances, E., and Rubio, F. J.. 2013. “On the Existence of a Normal Approximation to the Distribution of the Ratio of Two Independent Normal Random Variables.” Statistical Papers 54 (2): 309323.CrossRefGoogle Scholar
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A.. 2015. “The Reusable Holdout: Preserving Validity in Adaptive Data Analysis.” Science 349 (6248): 636638.CrossRefGoogle ScholarPubMed
Dwork, C., McSherry, F., Nissim, K., and Smith, A.. 2006. “Calibrating Noise to Sensitivity in Private Data Analysis.” In Theory of Cryptography Conference, 265284. Berlin: Springer.CrossRefGoogle Scholar
Dwork, C., and Roth, A.. 2014. “The Algorithmic Foundations of Differential Privacy.” Founda-tions and Trends in Theoretical Computer Science 9 (3–4): 211407.CrossRefGoogle Scholar
Evans, G., and King, G.. 2021a. “Replication Data for: Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset.” https://doi.org/10.7910/DVN/UDFZJD, Harvard Dataverse, V1, UNF:6:qVAL2iA9dusDRaLhZ1X4xg== [fileUNF].CrossRefGoogle Scholar
Evans, G., and King, G.. 2021b. “Statistically Valid Inferences from Differentially Private Data Releases, II: Extensions to Nonlinear Transformations.” Working Paper. GaryKing.org/dpd2.Google Scholar
Evans, G., King, G., Schwenzfeier, M., and Thakurta, A.. 2020. “Statistically Valid Inferences from Privacy Protected Data.” GaryKing.org/dp.Google Scholar
Evans, G., King, G., Smith, A. D., and Thankurta, A.. 2022. “Differentially Private Survey Research.” American Journal of Political Science, to appear. Preprint available at garyking.org/DPsurvey.Google Scholar
Fan, J. 1991. “On the Optimal Rates of Convergence for Nonparametric Deconvolution Problems.” The Annals of Statistics 19 (3):12571272.CrossRefGoogle Scholar
Gaboardi, M., Lim, H.-W., Rogers, R. M., and Vadhan, S. P.. 2016. “Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing.” In ICML’16 Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48. JMLR.Google Scholar
Garfinkel, S. L, Abowd, J. M., and Powazek, S.. 2018. “Issues Encountered Deploying Differential Privacy.” In Proceedings of the 2018 Workshop on Privacy in the Electronic Society, 133137. New York: ACM.Google Scholar
Glynn, A. N. 2013. “What Can We Learn with Statistical Truth Serum? Design and Analysis of the List Experiment.” Public Opinion Quarterly 77 (S1): 159172.CrossRefGoogle Scholar
Goldberger, A. 1991. A Course in Econometrics. Cambridge, MA: Harvard University Press.Google Scholar
Gong, R. 2019. “Exact Inference with Approximate Computation for Differentially Private Data via Perturbations.” arXiv:1909.12237.Google Scholar
Hersh, E. D., and Nall, C.. 2016. “The Primacy of Race in the Geography of Income-Based Voting: New Evidence from Public Voting Records.” American Journal of Political Science 60 (2): 289303.CrossRefGoogle Scholar
Jayaraman, B., and Evans, D.. 2019. “Evaluating Differentially Private Machine Learning in Practice.” In 28th USENIX Security Symposium (USENIX Security 19). Santa Clara, CA: USENIX Association.Google Scholar
Karwa, V., and Vadhan, S.. 2017. “Finite Sample Differentially Private Confidence Intervals.” arXiv:1711.03908.Google Scholar
King, G. 1989. “Variance Specification in Event Count Models: From Restrictive Assumptions to a Generalized Estimator.” American Journal of Political Science 33 (3): 762784.CrossRefGoogle Scholar
King, G., and Persily, N.. 2020. “A New Model for Industry–Academic Partnerships.” PS: Political Science & Politics 53 (4): 703709.Google Scholar
King, G., and Signorino, C. S.. 1996. “The Generalization in the Generalized Event Count Model.” Political Analysis 6: 225252.CrossRefGoogle Scholar
King, G., Tomz, M., and Wittenberg, J.. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44 (2): 341355.CrossRefGoogle Scholar
Messing, S., DeGregorio, C., Hillenbrand, B., King, G., Mahanti, S., Muk-erjee, Z., Nayak, C., Persily, N., State, B., and Wilkins, A.. 2020. “Facebook Privacy-Protected Full URLs Data Set.” https://doi.org/10.7910/DVN/TDOAPG, Harvard Dataverse, V8.CrossRefGoogle Scholar
Mnatsakanov, R. M. 2008. “Hausdorff Moment Problem: Reconstruction of Distributions.” Statistics & Probability Letters 78 (12): 16121618.CrossRefGoogle Scholar
Oberski, D. L., and Kreuter, F. 2020. “Differential Privacy and Social Science: An Urgent Puzzle.” Harvard Data Science Review 2 (1).CrossRefGoogle Scholar
Papoulis, A. 1984. Random Variables, and Stochastic Processes. New York: McGraw-Hill.Google Scholar
Sheffet, O. 2019. “Differentially Private Ordinary Least Squares.” Journal of Privacy and Confidentiality 9 (1): 143.CrossRefGoogle Scholar
Smith, A. 2011. “Privacy-Preserving Statistical Estimation with Optimal Convergence Rates.” In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, 813822. New York: ACM.CrossRefGoogle Scholar
Stefanski, L. A. 2000. “Measurement Error Models.” Journal of the American Statistical Association 95 (452): 13531358.CrossRefGoogle Scholar
Štulajter, F. 1978. “Nonlinear Estimators of Polynomials in Mean Values of a Gaussian Stochastic Process.” Kybernetika 14 (3): 206220.Google Scholar
Sweeney, L. 1997. “Weaving Technology and Policy Together to Maintain Confidentiality.” The Journal of Law, Medicine & Ethics 25 (2–3): 98110.CrossRefGoogle ScholarPubMed
Thomas, L., Stefanski, L., and Davidian, M.. 2011. “A Moment-Adjusted Imputation Method for Measurement Error Models.” Biometrics 67 (4): 14611470.CrossRefGoogle ScholarPubMed
Vadhan, S. 2017. “The Complexity of Differential Privacy.” In Tutorials on the Foundations of Cryptography, 347450. Berlin: Springer.CrossRefGoogle Scholar
Wang, Y., Kifer, D., and Lee, J.. 2018. “Differentially Private Confidence Intervals for Empirical Risk Minimization.” arXiv:1804.03794.CrossRefGoogle Scholar
Wang, Y., Lee, J., and Kifer, D.. 2015. “Differentially Private Hypothesis Testing, Revisited.” arXiv:1511.03376.Google Scholar
Warren, R. D., White, J. K., and Fuller, W. A.. 1974. “An Errors-in-Variables Analysis of Managerial Role Performance.” Journal of the American Statistical Association 69 (348): 886893.CrossRefGoogle Scholar
Williams, O., and McSherry, F. 2010. “Probabilistic Inference and Differential Privacy.” Advances in Neural Information Processing Systems 23:24512459.Google Scholar