Post-selection Inference in Multiverse Analysis (PIMA): An Inferential Framework Based on the Sign Flipping Score Test

Paolo Girardi; Anna Vesely; Daniël Lakens; Gianmarco Altoè; Massimiliano Pastore; Antonio Calcagnì; Livio Finos

doi:10.1007/s11336-024-09973-6

Post-selection Inference in Multiverse Analysis (PIMA): An Inferential Framework Based on the Sign Flipping Score Test

Published online by Cambridge University Press: 27 December 2024

and

Paolo Girardi*: Affiliation:
Ca’ Foscari University of Venice
Anna Vesely: Affiliation:
University of Bologna
Daniël Lakens: Affiliation:
Eindhoven University of Technology
Gianmarco Altoè: Affiliation:
University of Padova
Massimiliano Pastore: Affiliation:
University of Padova
Antonio Calcagnì: Affiliation:
University of Padova, GNCS-INdAM Research Group
Livio Finos: Affiliation:
University of Padova
*: Correspondence should bemade to Paolo Girardi, Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Via Torino 155, 30172 Venezia-Mestre, VE, Italy. Email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

When analyzing data, researchers make some choices that are either arbitrary, based on subjective beliefs about the data-generating process, or for which equally justifiable alternative choices could have been made. This wide range of data-analytic choices can be abused and has been one of the underlying causes of the replication crisis in several fields. Recently, the introduction of multiverse analysis provides researchers with a method to evaluate the stability of the results across reasonable choices that could be made when analyzing data. Multiverse analysis is confined to a descriptive role, lacking a proper and comprehensive inferential procedure. Recently, specification curve analysis adds an inferential procedure to multiverse analysis, but this approach is limited to simple cases related to the linear model, and only allows researchers to infer whether at least one specification rejects the null hypothesis, but not which specifications should be selected. In this paper, we present a Post-selection Inference approach to Multiverse Analysis (PIMA) which is a flexible and general inferential approach that considers for all possible models, i.e., the multiverse of reasonable analyses. The approach allows for a wide range of data specifications (i.e., preprocessing) and any generalized linear model; it allows testing the null hypothesis that a given predictor is not associated with the outcome, by combining information from all reasonable models of multiverse analysis, and provides strong control of the family-wise error rate allowing researchers to claim that the null hypothesis can be rejected for any specification that shows a significant effect. The inferential proposal is based on a conditional resampling procedure. We formally prove that the Type I error rate is controlled, and compute the statistical power of the test through a simulation study. Finally, we apply the PIMA procedure to the analysis of a real dataset on the self-reported hesitancy for the COronaVIrus Disease 2019 (COVID-19) vaccine before and after the 2020 lockdown in Italy. We conclude with practical recommendations to be considered when implementing the proposed procedure.

Keywords

multiverse analysis flipping score statistical inference testing reproducibility replicability

Type: Theory & Methods
Information: Psychometrika , Volume 89 , Issue 2 , June 2024 , pp. 542 - 568

DOI: https://doi.org/10.1007/s11336-024-09973-6 [Opens in a new window]
Copyright: Copyright © 2024 The Author(s), under exclusive licence to The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Agresti, A.. (2015). Foundations of linear and generalized linear models, Wiley.Google Scholar

Begg, C. B., Berlin, J. A.. (1988). Publication bias: A problem in interpreting medical data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 151(3), 419–445.CrossRef Google Scholar

Benjamini, Y. (2020). Selective inference: The silent killer of replicability. Harvard Data Science Review, 2(4). https://hdsr.mitpress.mit.edu/pub/l39rpgyc.Google Scholar

Berger, R. L.. (1982). Multiparameter hypothesis testing and acceptance sampling. Technometrics, 24(4), 295–300.CrossRef Google Scholar

Brodeur, A., Lé, M., Sangnier, M., Zylberberg, Y.. (2016). Star wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1), 1–32.Google Scholar

Caserotti, M., Girardi, P., Rubaltelli, E., Tasso, A., Lotto, L., Gavaruzzi, T.. (2021). Associations of COVID-19 risk perception with vaccine hesitancy over time for Italian residents. Social Science & Medicine, 272.CrossRef Google Scholar PubMed

De Santis, R., Goeman, J. J., Hemerik, J., & Finos, L. (2022). Inference in generalized linear models with robustness to misspecified variances.Google Scholar

Dragicevic, P., Jansen, Y., Sarma, A., Kay, M., & Chevalier, F. (2019). Increasing the transparency of research papers with explorable multiverse analyses. In Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1–15).CrossRef Google Scholar

Dwan, K., Altman, D. G., Arnaiz, J. A., Bloom, J., Chan, A.-W., Cronin, E., Decullier, E., Easterbrook, P. J., Von Elm, E., Gamble, C. et al., Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS ONE, (2008) 3(8.CrossRef Google Scholar PubMed

Fanelli, D.. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90(3), 891–904.CrossRef Google Scholar

Finner, H.. (1999). Stepwise multiple test procedures and control of directional errors. The Annals of Statistics, 27(1), 274–289.CrossRef Google Scholar

Finos, L., Hemerik, J., & Goeman, J. J. (2023). jointest: Multivariate testing through joint sign-flip scores. R package version 1.2.0.Google Scholar

Fisher, R. A.. (1925). Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd.Google Scholar

Flachaire, E.. (1999). A better way to bootstrap pairs. Economics Letters, 64(3), 257–262.CrossRef Google Scholar

Freedman, D. A.. (1981). Bootstrapping regression models. The Annals of Statistics, 9(6), 1218–1228.CrossRef Google Scholar

Frey, R., Richter, D., Schupp, J., Hertwig, R., Mata, R.. (2021). Identifying robust correlates of risk preference: A systematic approach using specification curve analysis. Journal of Personality and Social Psychology, 120(2), 538.CrossRef Google Scholar PubMed

Gelman, A., Loken, E.. (2014). The statistical crisis in science data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don’t hold up. American Scientist, 102(6), 460.CrossRef Google Scholar

Genovese, C. R., Wasserman, L.. (2006). Exceedance control of the false discovery proportion. Journal of the American Statistical Association, 101(476), 1408–1417.CrossRef Google Scholar

Goeman, J. J., Hemerik, J., Solari, A.. (2021). Only closed testing procedures are admissible for controlling false discovery proportions. Annals of Statistics, 49(2), 1218–1238.CrossRef Google Scholar

Goeman, J. J., Solari, A.. (2011). Multiple testing for exploratory research. Statistical Science, 26(4), 584–597.CrossRef Google Scholar

Greenwald, A. G.. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82(1), 1.CrossRef Google Scholar

Harder, J. A.. (2020). The multiverse of methods: Extending the multiverse analysis to address data-collection decisions. Perspectives on Psychological Science, 15(5), 1158–1177.CrossRef Google Scholar PubMed

Hemerik, J., Goeman, J. J.. (2018). Exact testing with random permutations. TEST, 27, 811–825.CrossRef Google Scholar PubMed

Hemerik, J., Goeman, J. J., Finos, L.. (2020). Robust testing in generalized linear models by sign flipping score contributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3), 841–864.CrossRef Google Scholar

Klau, S., Hoffmann, S., Patel, C. J., Ioannidis, J. P., Boulesteix, A.-L.. (2020). Examining the robustness of observational associations to model, measurement and sampling uncertainty with the vibration of effects framework. International Journal of Epidemiology, 50(1), 266–278.CrossRef Google Scholar

Liptak, T.. (1958). On the combination of independent tests. Magyar Tud. Akad. Mat. Kutató Int. Közl., 3, 1971–1977.Google Scholar

Liu, Y., Kale, A., Althoff, T., Heer, J.. (2020). Boba: Authoring and visualizing multiverse analyses. IEEE Transactions on Visualization and Computer Graphics, 27(2), 1753–1763.CrossRef Google Scholar

Marcus, R., Peritz, E., Gabriel, K. R.. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika, 63(3), 655–660.CrossRef Google Scholar

Mirman, J. H., Murray, A. L., Mirman, D., Adams, S. A.. (2021). Advancing our understanding of cognitive development and motor vehicle crash risk: A multiverse representation analysis. Cortex, 138, 90–100.CrossRef Google Scholar

Modecki, K. L., Low-Choy, S., Uink, B. N., Vernon, L., Correia, H., Andrews, K.. (2020). Tuning into the real effect of smartphone use on parenting: A multiverse analysis. Journal of Child Psychology and Psychiatry, 61(8), 855–865.CrossRef Google Scholar PubMed

Nosek, B. A., Lakens, D.. (2014). A method to increase the credibility of published results. Social Psychology, 45(3), 137–141.CrossRef Google Scholar

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science, vol. 349. American Association for the Advancement of Science.CrossRef Google Scholar

Pesarin, F.. (2001). Multivariate Permutation Tests: With Applications in Biostatistics. New York: Wiley.Google Scholar

R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing.Google Scholar

Ramdas, A., Barber, R. F., Candès, E. J., & Tibshirani, R. J. (2023). Permutation tests using arbitrary permutation distributions. Sankhya A, 1–22.CrossRef Google Scholar

Rijnhart, J. J., Twisk, J. W., Deeg, D. J., & Heymans, M. W. (2021). Assessing the robustness of mediation analysis results using multiverse analysis. Prevention Science, 1–11.Google Scholar

Shaffer, J. P.. (1980). Control of directional errors with stagewise multiple test procedures. The Annals of Statistics, 8(6), 1342–1347.CrossRef Google Scholar

Simmons, J. P., Nelson, L. D., Simonsohn, U.. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.CrossRef Google Scholar PubMed

Simonsohn, U., Simmons, J. P., Nelson, L. D.. (2020). Specification curve analysis. Nature Human Behaviour, 4(11), 1208–1214.CrossRef Google Scholar PubMed

Steegen, S., Tuerlinckx, F., Gelman, A., Vanpaemel, W.. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702–712.CrossRef Google Scholar PubMed

Sterling, T. D.. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance-or vice versa. Journal of the American Statistical Association, 54(285), 30–34.Google Scholar

van der Vaart, A. W.. (1998). Asymptotic Statistics, Cambridge University Press.CrossRef Google Scholar

Vesely, A., Goeman, J. J., & Finos, L. (2022). Resampling-based multisplit inference for high-dimensional regression. arXiv:2205.12563.Google Scholar

Vesely, A., Finos, L., Goeman, J. J.. (2023). Permutation-based true discovery guarantee by sum tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 85(3), 664–683.CrossRef Google Scholar

Wessel, I., Albers, C. J., Zandstra, A. R. E., Heininga, V. E.. (2020). A multiverse analysis of early attempts to replicate memory suppression with the think/no-think task. Memory, 28(7), 870–887.CrossRef Google Scholar PubMed

Westfall, P. H., Young, S. S.. (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. New York: Wiley.Google Scholar

Article contents

Post-selection Inference in Multiverse Analysis (PIMA): An Inferential Framework Based on the Sign Flipping Score Test

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests