No CrossRef data available.
Article contents
Test many theories in many ways
Published online by Cambridge University Press: 05 February 2024
Abstract
Demonstrating the limitations of the one-at-a-time approach, crowd initiatives reveal the surprisingly powerful role of analytic and design choices in shaping scientific results. At the same time, cross-cultural variability in effects is far below the levels initially expected. This highlights the value of “medium” science, leveraging diverse stimulus sets and extensive robustness checks to achieve integrative tests of competing theories.
- Type
- Open Peer Commentary
- Information
- Copyright
- Copyright © The Author(s), 2024. Published by Cambridge University Press
References
Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., Van Ravenzwaaij, D., … Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences of the United States of America, 115(11), 2607–2612.CrossRefGoogle ScholarPubMed
Botvinik-Nezer, R., Holzmeister, F., Camerer, C. F., Dreber, A., Huber, J., Johannesson, M., … Schonberg, T. (2020). Variability in the analysis of a single neuroimaging dataset by many teams. Nature, 582, 84–88.CrossRefGoogle ScholarPubMed
Brescoll, V. L., & Uhlmann, E. L. (2008). Can an angry woman get ahead? Status conferral, gender, and expression of emotion in the workplace. Psychological Science, 19, 268–275.CrossRefGoogle ScholarPubMed
Breznau, N., Rinke, E. M., Wuttke, A., Nguyen, H. H., Adem, M., Adriaans, J., … Van Assche, J. (2022). Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proceedings of the National Academy of Sciences of the United States of America, 119(44), e2203150119.CrossRefGoogle ScholarPubMed
Delios, A., Clemente, E., Wu, T., Tan, H., Wang, Y., Gordon, M., … Uhlmann, E. L. (2022). Examining the context sensitivity of research findings from archival data. Proceedings of the National Academy of Sciences of the United States of America, 119(30), e2120377119.CrossRefGoogle ScholarPubMed
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33, 61–83.CrossRefGoogle ScholarPubMed
Landy, J. F., Jia, M., Ding, I. L., Viganola, D., Tierney, W., Dreber, A., … Uhlmann, E. L. (2020). Crowdsourcing hypothesis tests: Making transparent how design choices shape research results. Psychological Bulletin, 146(5), 451–479.CrossRefGoogle ScholarPubMed
Leavitt, K., Mitchell, T., & Peterson, J. (2010). Theory pruning: Strategies for reducing our dense theoretical landscape. Organizational Research Methods, 13, 644–667.CrossRefGoogle Scholar
Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge University Press.CrossRefGoogle Scholar
McGuire, W. J. (1973). The yin and yang of progress in social psychology: Seven koan. Journal of Personality and Social Psychology, 26(3), 446–456.CrossRefGoogle Scholar
Menkveld, A. J., Dreber, A., Holzmeister, F., Huber, J., Johannesson, M., Kirchler, M., … Wu, Z.-X. (2023). Non-standard errors. The Journal of Finance. http://dx.doi.org/10.2139/ssrn.3961574Google Scholar
Muthukrishna, M., Bell, A. V., Henrich, J., Curtin, C. M., Gedranovich, A., McInerney, J., & Thue, B. (2020). Beyond western, educated, industrial, rich, and democratic (WEIRD) psychology: Measuring and mapping scales of cultural and psychological distance. Psychological Science, 31(6), 678–701.CrossRefGoogle ScholarPubMed
Norenzayan, A., & Heine, S. J. (2005). Psychological universals: What are they and how can we know? Psychological Bulletin, 135, 763–784.CrossRefGoogle Scholar
Olsson-Collentine, A., Wicherts, J. M., & van Assen, M. A. L. M. (2020). Heterogeneity in direct replications in psychology and its association with effect size. Psychological Bulletin, 146(10), 922–940.CrossRefGoogle ScholarPubMed
Platt, J. R. (1964). Strong inference. Science (New York, N.Y.), 146, 347–353.CrossRefGoogle ScholarPubMed
Schweinsberg, M., Feldman, M., Staub, N., van den Akker, O., van Aert, R., van Assen, M., … Uhlmann, E. (2021). Radical dispersion of effect size estimates when independent scientists operationalize and test the same hypothesis with the same data. Organizational Behavior and Human Decision Processes, 165, 228–249.CrossRefGoogle Scholar
Silberzahn, R., Uhlmann, E. L., Martin, D., Anselmi, P., Aust, F., Awtrey, E., … Nosek, B. A. (2018). Many analysts, one dataset: Making transparent how variations in analytical choices affect results. Advances in Methods and Practices in Psychological Science, 1, 337–356.CrossRefGoogle Scholar
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11, 702–712.CrossRefGoogle ScholarPubMed
Tierney, W., Cyrus-Lai, W., … (2023). Who respects an angry woman? A pre-registered re-examination of the relationships between gender, emotion expression, and status conferral. Unpublished manuscript.Google Scholar
Tierney, W., Hardy, J. H. III., Ebersole, C., Leavitt, K., Viganola, D., Clemente, E., … Uhlmann, E. (2020). Creative destruction in science. Organizational Behavior and Human Decision Processes, 161, 291–309.CrossRefGoogle Scholar
Tierney, W., Hardy, J. H. III., Ebersole, C. R., Viganola, D., Clemente, E. G., Gordon, M., … Uhlmann, E. L. (2021). A creative destruction approach to replication: Implicit work and sex morality across cultures. Journal of Experimental Social Psychology, 93, 104060.CrossRefGoogle Scholar
You have
Access
Almaatouq et al. argue that the “one-at-a-time” approach to scientific research has led to collections of atomized findings of unclear relevance to each other. They advocate for an integrative approach in which stimuli are varied systematically across theoretically important dimensions. This allows for strong inferences (Platt, Reference Platt1964) regarding which theory holds the most explanatory power across diverse contexts, as well as the identification of meaningful moderators.
Our research group has addressed this challenge by examining the analytic and design choices that naturalistically emerge across independent investigators as well as the implications for the empirical results (Landy et al., Reference Landy, Jia, Ding, Viganola, Tierney, Dreber and Uhlmann2020; Schweinsberg et al., Reference Schweinsberg, Feldman, Staub, van den Akker, van Aert, van Assen and Uhlmann2021; Silberzahn et al., Reference Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey and Nosek2018). These crowdsourced many analysts and many design initiatives reveal dramatic dispersion in estimates due to researcher choices, empirically demonstrating the limitations of the one-at-a-time approach (see also Baribault et al., Reference Baribault, Donkin, Little, Trueblood, Oravecz, Van Ravenzwaaij and Vandekerckhove2018; Botvinik-Nezer et al., Reference Botvinik-Nezer, Holzmeister, Camerer, Dreber, Huber, Johannesson and Schonberg2020; Breznau et al., Reference Breznau, Rinke, Wuttke, Nguyen, Adem, Adriaans and Van Assche2022; Menkveld et al., Reference Menkveld, Dreber, Holzmeister, Huber, Johannesson, Kirchler and Wu2023). At the same time, we have sought to further increase the already high-theoretical value of replications by leveraging them for competitive theory testing. Rather than test the original theory against the null hypothesis, we include new conditions and measures allowing us to simultaneously examine the preregistered predictions of different theoretical accounts (Tierney et al., Reference Tierney, Hardy, Ebersole, Leavitt, Viganola, Clemente and Uhlmann2020, Reference Tierney, Hardy, Ebersole, Viganola, Clemente, Gordon and Uhlmann2021). In this manner, we can start to prune the dense theoretical landscape (Leavitt, Mitchell, & Peterson, Reference Leavitt, Mitchell and Peterson2010) found in areas of inquiry characterized by many atomized findings and narrow theories.
In contrast, a striking and unexpected lack of variability has emerged in the results when many laboratories collect data using the same methods. In such crowd replication initiatives, cross-site heterogeneity in estimates is far below what one would expect based on intuition and theory (Olsson-Collentine, Wicherts, & van Assen, Reference Olsson-Collentine, Wicherts and van Assen2020). From a perspectivist (McGuire, Reference McGuire1973) standpoint, psychological phenomena should emerge in some contexts and be nonexistent or even reversed in others (see also Henrich, Heine, & Norenzayan, Reference Henrich, Heine and Norenzayan2010). And yet, effects seem to either fail to replicate across all populations sampled or emerge again and again (see also Delios et al., Reference Delios, Clemente, Wu, Tan, Wang, Gordon and Uhlmann2022).
Bringing many designs, analyses, theories, and data collection teams together, we recently completed a crowdsourced initiative that qualifies as the type of comprehensive integrative test that Almaatouq et al. envision. Tierney et al. (Reference Tierney and Cyrus-Lai2023) systematically re-examined the relationships between anger expression, target gender, and status conferral. In the original research, women who displayed anger in professional settings suffered steep drops in the status and respect they were accorded by social perceivers (Brescoll & Uhlmann, Reference Brescoll and Uhlmann2008). In the original investigations, only a single set of videos featuring one female and one male target were employed as stimuli, and all participants were from Connecticut. In contrast, the crowdsourced replication project featured 27 experimental designs, a multiverse capturing many defensible analytic approaches, and 68 data collection sites in 23 countries. We further tested the original prescriptive stereotype account against competing theories predicting that anger signals status similarly for women and men, that anger has vastly different status implications in Eastern and Western cultures, and that feminist messaging has successfully reduced or even reversed gender biases. As Almaatouq et al. recommend, we probed the dose–response relationship between anger and status conferral by both experimentally manipulating and measuring the extremity of emotion expressions across different designs.
The crowd initiative finds that anger increases status by signaling dominance and assertiveness, while also diminishing it by projecting incompetence and unlikability, aggregating across a wide range of research approaches and populations. Critically, this same pattern emerged for both female and male targets, social perceivers of different genders, and in both Eastern and harmony-oriented cultures and Western and more conflict-oriented ones. Highlighting the value of deploying diverse research approaches, six of the 27 designs found favoritism toward men in status conferral, but one design pointed to the opposite conclusion. Similarly, in a multiverse with 32 branches, there existed just two specifications that supported the original gender-and-anger backlash effect. Had we employed a one-at-a-time approach, we could have accidentally hit upon or strategically chosen narrow methods yielding nonrepresentative conclusions (e.g., of pro-female status bias or gender backlash). Overall, the intellectual returns on including many designs, many analyses, and many theories were high. In contrast, and consistent with past crowd initiatives, collecting data across many places revealed minimal cross-site heterogeneity and no interesting cultural differences.
Thus, we envision a diverse scientific ecology consisting of many “small” and “medium” projects and just a few huge international efforts. The one-at-a-time approach is an efficient means to introduce initial evidence for promising new hypotheses. However, as a theoretical space becomes increasingly cluttered, intellectual returns are maximized by sampling stimuli widely and employing many analyses to provide severe tests of competing theories (Mayo, Reference Mayo2018). Although this could involve a crowd of laboratories, a single team could carry out a multiverse (Steegen, Tuerlinckx, Gelman, & Vanpaemel, Reference Steegen, Tuerlinckx, Gelman and Vanpaemel2016) and operationalize key variables in a variety of ways. A small team might sample just one or two participant populations that are easily accessible to them. Finally, a subset of findings of particularly high-theoretical and practical importance should be selected for crowdsourced data collections across many nations as a systematic test of cross-cultural generalizability. When numerous sites are not available, the researchers might carry out the first generalizability test in the most culturally distant population available (Muthukrishna et al., Reference Muthukrishna, Bell, Henrich, Curtin, Gedranovich, McInerney and Thue2020). If the effect is still observed, this represents initial evidence of universality (Norenzayan & Heine, Reference Norenzayan and Heine2005).
In sum, an ironic legacy of the movement to crowdsource behavioral research may be showing that scaling science to such a massive level might be neither efficient nor strictly necessary for most research findings. The sorts of integrative tests Almaatouq et al. envision can also be accomplished by a small team that actively ensures a diversity of analyses and stimuli, and yet collects data locally or across a few carefully selected cultures rather than globally. In the future, our greatest intellectual returns on investment may come from “medium” science that prioritizes testing many theories in many ways.
Financial support
This research was supported by an R&D grant from INSEAD to Eric Luis Uhlmann.
Competing interest
None.