Hostname: page-component-7b9c58cd5d-g9frx Total loading time: 0 Render date: 2025-03-22T11:43:27.677Z Has data issue: false hasContentIssue false

Multiple hypothesis testing in experimental economics

Published online by Cambridge University Press:  14 March 2025

John A. List*
Affiliation:
Department of Economics, University of Chicago, 5757 S University Ave, Chicago, IL 60637, USA
Azeem M. Shaikh*
Affiliation:
Department of Economics, University of Chicago, 5757 S University Ave, Chicago, IL 60637, USA
Yang Xu*
Affiliation:
Department of Economics, University of Chicago, 5757 S University Ave, Chicago, IL 60637, USA

Abstract

The analysis of data from experiments in economics routinely involves testing multiple null hypotheses simultaneously. These different null hypotheses arise naturally in this setting for at least three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments. In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using the general results in Romano and Wolf (Ann Stat 38:598–633, 2010), we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. We illustrate our methodology by revisiting the study by Karlan and List (Am Econ Rev 97(5):1774–1793, 2007) of why people give to charitable causes.

Type
Original Paper
Copyright
Copyright © 2019 Economic Science Association

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Documentation of our procedures and our Stata and Matlab code can be found at https://github.com/seidelj/mht.

References

Anderson, M (2008). Multiple inference and gender differences in the effects of early intervention: A re-evaluation of the abecedarian, perry preschool, and early training projects. Journal of the American Statistical Association, 103(484), 14811495. 10.1198/016214508000000841CrossRefGoogle Scholar
Bettis, RA (2012). The search for asterisks: Compromised statistical tests and flawed theories. Strategic Management Journal, 33(1), 108113. 10.1002/smj.975CrossRefGoogle Scholar
Bhattacharya, J, Shaikh, AM, & Vytlacil, E (2012). Treatment effect bounds: An application to swan-ganz catheterization. Journal of Econometrics, 168(2), 223243. 10.1016/j.jeconom.2012.01.001CrossRefGoogle Scholar
Bonferroni, CE (1935). Il calcolo delle assicurazioni su gruppi di teste, Rome: Tipografia del Senato.Google Scholar
Bugni, F., Canay, I., & Shaikh, A. (2015). Inference under covariate-adaptive randomization. Technical report, cemmap working paper, Centre for Microdata Methods and Practice.Google Scholar
Camerer, CF, Dreber, A, Forsell, E, Ho, T-H, Huber, J, Johannesson, M, Kirchler, M, Almenberg, J, Altmejd, A, Chan, T et al., (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 14331436. 10.1126/science.aaf0918CrossRefGoogle ScholarPubMed
Fink, G, McConnell, M, & Vollmer, S (2014). Testing for heterogeneous treatment effects in experimental data: False discovery risks and correction procedures. Journal of Development Effectiveness, 6(1), 4457. 10.1080/19439342.2013.875054CrossRefGoogle Scholar
Flory, J. A., Gneezy, U., Leonard, K. L., & List, J. A. (2015a). Gender, age, and competition: The disappearing gap. Unpublished Manuscript.Google Scholar
Flory, JA, Leibbrandt, A, & List, JA (2015). Do competitive workplaces deter female workers? A large-scale natural field experiment on job-entry decisions. The Review of Economic Studies, 82(1), 122155. 10.1093/restud/rdu030CrossRefGoogle Scholar
Gneezy, U, Niederle, M, & Rustichini, A (2003). Performance in competitive environments: Gender differences. The Quarterly Journal of Economics, 118(3), 10491074. 10.1162/00335530360698496CrossRefGoogle Scholar
Heckman, J, Moon, SH, Pinto, R, Savelyev, P, & Yavitz, A (2010). Analyzing social experiments as implemented: A reexamination of the evidence from the highscope perry preschool program. Quantitative Economics, 1(1), 146. 10.3982/QE8Google ScholarPubMed
Heckman, J. J., Pinto, R., Shaikh, A. M., & Yavitz, A. (2011). Inference with imperfect randomization: The case of the perry preschool program. National Bureau of Economic Research Working Paper w16935.CrossRefGoogle Scholar
Holm, S (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 6570.Google Scholar
Hossain, T, & List, JA (2012). The behavioralist visits the factory: Increasing productivity using simple framing manipulations. Management Science, 58(12), 21512167. 10.1287/mnsc.1120.1544CrossRefGoogle Scholar
Ioannidis, J (2005). Why most published research findings are false. PLoS Med, 2(8), e124 10.1371/journal.pmed.0020124CrossRefGoogle ScholarPubMed
Jennions, MD, & Moller, AP (2002). Publication bias in ecology and evolution: An empirical assessment using the ‘trim and fill’ method. Biological Reviews of the Cambridge Philosophical Society, 77(02), 211222. 10.1017/S1464793101005875CrossRefGoogle ScholarPubMed
Karlan, D, & List, JA (2007). Does price matter in charitable giving? Evidence from a large-scale natural field experiment. The American Economic Review, 97(5), 17741793. 10.1257/aer.97.5.1774CrossRefGoogle Scholar
Kling, J, Liebman, J, & Katz, L (2007). Experimental analysis of neighborhood effects. Econometrica, 75(1), 83119. 10.1111/j.1468-0262.2007.00733.xCrossRefGoogle Scholar
Lee, S, & Shaikh, AM (2014). Multiple testing and heterogeneous treatment effects: Re-evaluating the effect of progresa on school enrollment. Journal of Applied Econometrics, 29(4), 612626. 10.1002/jae.2327CrossRefGoogle Scholar
Lehmann, E, & Romano, J (2005). Generalizations of the familywise error rate. The Annals of Statistics, 33(3), 11381154. 10.1214/009053605000000084CrossRefGoogle Scholar
Lehmann, EL, & Romano, JP (2006). Testing statistical hypotheses, Berlin: Springer.Google Scholar
Levitt, S. D., List, J. A., Neckermann, S., & Sadoff, S. (2012). The behavioralist goes to school: Leveraging behavioral economics to improve educational performance. National Bureau of Economic Research w18165.CrossRefGoogle Scholar
List, JA, & Samek, AS (2015). The behavioralist as nutritionist: Leveraging behavioral economics to improve child food choice and consumption. Journal of Health Economics, 39, 135146. 10.1016/j.jhealeco.2014.11.002CrossRefGoogle ScholarPubMed
Machado, C., Shaikh, A., Vytlacil, E., & Lunch, C. (2013). Instrumental variables, and the sign of the average treatment effect. Unpublished Manuscript, Getulio Vargas Foundation, University of Chicago, and New York University. [2049].Google Scholar
Maniadis, Z, Tufano, F, & List, JA (2014). One swallow doesn’t make a summer: New evidence on anchoring effects. The American Economic Review, 104(1), 277290. 10.1257/aer.104.1.277CrossRefGoogle Scholar
Niederle, M, & Vesterlund, L (2007). Do women shy away from competition? Do men compete too much?. The Quarterly Journal of Economics, 122(3), 10671101. 10.1162/qjec.122.3.1067CrossRefGoogle Scholar
Nosek, BA, Spies, JR, & Motyl, M (2012). Scientific utopia ii. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615631. 10.1177/1745691612459058CrossRefGoogle ScholarPubMed
Romano, J. P., & Shaikh, A. M. (2006a). On stepdown control of the false discovery proportion. In Lecture Notes-Monograph Series (pp. 3350).CrossRefGoogle Scholar
Romano, JP, & Shaikh, AM (2006). Stepup procedures for control of generalizations of the familywise error rate. The Annals of Statistics, 34, 18501873. 10.1214/009053606000000461CrossRefGoogle Scholar
Romano, JP, & Shaikh, AM (2012). On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6), 27982822. 10.1214/12-AOS1051CrossRefGoogle Scholar
Romano, JP, Shaikh, AM, & Wolf, M (2008). Control of the false discovery rate under dependence using the bootstrap and subsampling. Test, 17(3), 417442. 10.1007/s11749-008-0126-6CrossRefGoogle Scholar
Romano, JP, Shaikh, AM, & Wolf, M (2008). Formalized data snooping based on generalized error rates. Econometric Theory, 24(02), 404447. 10.1017/S0266466608080171CrossRefGoogle Scholar
Romano, JP, & Wolf, M (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4), 12371282. 10.1111/j.1468-0262.2005.00615.xCrossRefGoogle Scholar
Romano, JP, & Wolf, M (2010). Balanced control of generalized error rates. The Annals of Statistics, 38, 598633. 10.1214/09-AOS734CrossRefGoogle Scholar
Sutter, M, & Glätzle-Rützler, D (2014). Gender differences in the willingness to compete emerge early in life and persist. Management Science, 61(10), 233923354. 10.1287/mnsc.2014.1981CrossRefGoogle Scholar
Westfall, PH, & Young, SS (1993). Resampling-based multiple testing: Examples and methods for p value adjustment, New York: Wiley.Google Scholar