Imagine a pharmaceutical company spends years developing a new cancer treatment. Because of the expense of drug development, the company collects extensive data during human trials. In particular, researchers collect data about hundreds of health outcomes other than cancer. When the data are analyzed, researchers find that treatment is associated with a reduction in breast cancer. Here’s an instance of a more general question:
Question: Should the pharmaceutical researchers alter their methods for analyzing the cancer data because the treatment for efficacy was assessed in many other ways?
According to many statisticians and scientists, the answer is yes. Let multiplicity refer to the act of evaluating many statistical hypotheses simultaneously. When multiplicity occurs, many statisticians and scientists recommend “correcting” Footnote 1 $p$ -values so as to reduce the number of false-positive results. Footnote 2 Although Bayesian statisticians reject the use of $p$ -values, many likewise argue that one’s statistical methods should be adjusted for multiplicity. Footnote 3 This raises a very general question:
Central Question: Under what conditions, if any, should statistical methods be adjusted for multiplicity? In what way should they be adjusted? And why?
The central question is important because as our computational power grows, so does our ability to evaluate thousands of policy-relevant statistical hypotheses in a matter of minutes.
Although statisticians have investigated the reliability of many adjustment procedures, few have clarified the central question. What exactly is adjustment? Can “adjustment” be defined without reference to particular statistical methods? If “adjusting” means “changing reported $p$ -values,” then devout Bayesian statisticians never adjust for multiplicity, as they avoid calculating $p$ -values! Footnote 4 So is there a sense of “adjustment” that renders classical and Bayesian approaches comparable?
The normative dimensions of the central question have also yet to be clarified. In what sense “should” one adjust for multiplicity? Is adjustment rationally required to achieve certain goals? If so, which goals? Is adjustment epistemically required to respect one’s evidence? Is it scientifically required by norms of scientific inquiry? Is it ethically obligatory? If adjustment is not obligatory, is it permissible or good in any sense? Footnote 5
Finally, answers to those normative questions depend on who or what is adjusting. Researchers can adjust reported $p$ -values. But so can journal editors. Grant-giving agencies—like the National Institutes of Health (NIH)—can also adjust for multiplicity in various ways. Which, if any, of these decision-making bodies should adjust?
The main contribution of this article is to (1) distinguish two senses of adjustment, (2) investigate the prudential and epistemic goals that adjustment might achieve, and (3) formulate more precise versions of the central question. I also prove a new theorem characterizing when adjustment is impermissible. I tentatively conclude that there is a mismatch between the goals of scientists (both individually and collectively) and the guarantees of existing adjustment procedures. This article, thus, is a call for further research: We must either prove existing adjustment methods achieve goals of actual scientific interest or develop alternative procedures.
1 Basic model
To distinguish types of adjustment, I introduce a model. Suppose $N$ hypotheses are under investigation. Assume that any subset of the $N$ hypotheses might be true. Let ${\rm{\Theta }} = {\{ 0,1\} ^N}$ be the set of all binary strings/vectors of length $N$ . A vector $\theta \in {\rm{\Theta }}$ , therefore, specifies which of the $N$ hypotheses are true and which are false. Let ${H_k} = \left\{ {\theta \in {\rm{\Theta }}:{\theta _k} = 0} \right\}$ be the set of vectors that say the $k$ th hypothesis is true.
Suppose that for each hypothesis ${H_k}$ , there is some experiment ${X_k}$ that could be conducted (or observation that could be made); researchers believe ${X_k}$ could be informative about whether ${H_k}$ holds. Formally, ${X_k}$ is a random variable, and for each $\theta \in {\rm{\Theta }}$ , let ${\mathbb{P}_\theta }\left( {{X_1}, \ldots, {X_N}} \right)$ denote the probability measure that specifies the chances of various experimental outcomes.
For simplicity, assume that for all $\theta \in {\rm{\Theta }}$ , the $N$ experiments are mutually independent with respect to ${\mathbb{P}_\theta }$ . In symbols, let $\vec X = \langle {X_{{i_1}}},{X_{{i_2}}}, \ldots, {X_{{i_k}}}\rangle $ be a random vector, representing some subset of the $N$ experiments. Then for all sequences $ \vec x = \left( {{x_{{i_1}}}, \ldots {x_{{i_k}}}} \right)$ representing the outcome of those $k \le N$ experiments,
Further, suppose that the truth or falsity of the ${H_k}$ entirely determines the probabilities of the possible outcomes of the $k$ th experiment; that is, for all $k \le N$ and all $r \in \left\{ {0,1} \right\}$ , there is a probability distribution ${\mathbb{P}_{k,r}}$ such that ${\mathbb{P}_\theta }\left( {{X_k} = {x_k}} \right) = {\mathbb{P}_{k,{\theta _k}}}\left( {{X_k} = {x_k}} \right)$ . Together with the assumption of mutual independence, this entails that
To assess whether a decision-maker should adjust for multiplicity, compare two types of situations. In the first, the decision-maker learns the outcome of a proper subset of the $N$ tests.
For simplicity, suppose that the researcher learns only the value of ${X_1}$ . In the second, she learns the values of all $N$ variables. Say that the decision-maker should adjust for multiplicity if her (1) beliefs or (2) decisions about ${H_1}$ should differ in those two situations. Let’s clarify those two senses of “adjustment.”
2 Belief
For the Bayesian, beliefs are modeled by posterior probabilities, and so a Bayesian adjusts for multiplicity if there is a value ${x_1}$ of ${X_1}$ such that
for all values ${x_2}, \ldots {x_N}$ of ${X_2}, \ldots {X_N}$ for which $P\left( {{X_1} = {x_1}, \ldots {X_N} = {x_N}} \right) \gt 0$ . One could distinguish a weaker sense of adjustment, whereby equation (3) holds for some values of ${X_2}, \ldots {X_N}$ . For critics of Bayesianism, one can replace the probability functions in equation (3) with another object representing belief. Footnote 6
Should one ever adjust for multiplicity, in the strong sense just identified? Yes. Consider a Bayesian researcher who regards the hypotheses as dependent, in that learning about one hypothesis provides evidence about another. For example, suppose our hypothetical pharmaceutical researchers consider two hypotheses: (1) The treatment is not effective in 33-year-old women, and (2) the treatment is not effective in 34-year-old women. A researcher might reasonably believe that the first hypothesis is true if and only if the second is. If so, acquiring data about 33-year-old women would provide evidence about the efficacy of the treatment for 34-year-old women. Here’s a toy model to illustrate such adjustment.
Example 1: Suppose each ${X_k}$ is a binary random variable that represents a test to retain or reject ${H_k}$ . Assume there are $\alpha, \beta \in \left( {0,1} \right)$ such that for all $\theta \in {\rm{\Theta }}$ ,
That is, each test ${X_k}$ has a Type I error of $\alpha $ and Type II error of $\beta $ .
To model a researcher who believes the hypotheses to be dependent, suppose that the researcher assigns positive probability to precisely two vectors in ${\rm{\Theta }}$ , namely, ${\bf 0} = \langle 0, \ldots, 0\rangle $ , which says each ${H_k}$ is true, and ${\bf 1} = \langle 1, \ldots, 1\rangle $ , which says each ${H_k}$ is false. If $\pi = P\left( {\bf 0} \right) = 1 - P\left( 1 \right)$ represents the researcher’s prior degree of belief that all hypotheses are true, then her posterior probability in ${H_1}$ if she learns only that the first test is negative equals the following:
In contrast, if she learns two tests are negative, her posterior is as follows:
Finally, if she learns the second test is positive, her posterior will be as follows:
If $0 \lt \pi \lt 1$ , then equation (4) equals both equation (5) and equation (6) if and only if $\alpha = \left( {1 - \beta } \right)$ . If $\alpha \ne \left( {1 - \beta } \right)$ , therefore, the Bayesian researcher adjusts for multiplicity in the strong sense defined in equation (3). Footnote 7
□
Example 1 illustrates the commonsense idea that when one believes two hypotheses stand or fall together, evidence for/against one hypothesis is evidence for/against the other. Thus, a Bayesian researcher will adjust for multiplicity. Similarly, if the researcher believes that evidence for one hypothesis is evidence against another, she will adjust for multiplicity, as can be shown by analogous calculations.
In short, if a researcher believes several hypotheses are dependent, she will typically adjust her beliefs for multiplicity. Conversely, if the researcher regards the hypotheses as mutually independent, then she will not adjust for multiplicity; in that case, it is easy to check that $P({H_1}|{X_1}) = P({H_1}|{X_1}, \ldots {X_N})$ —again, assuming equation (1) holds. Footnote 8
On the one hand, these results about the relationship between adjustment and dependence in the toy Bayesian model are not surprising. They illustrate the intuition that a researcher who wants to know what to believe about the effects of cigar smoking (i) will typically adjust her belief if she acquires data about the effects of cigarette smoking but (ii) will not adjust her beliefs if she acquires data about implicit bias.
On the other hand, the results begin to answer the central question. In particular, they answer the objection that there is no principled way to determine when to adjust (Perneger Reference Perneger1998). This objection is typically leveled against classical methods—like Bonferroni’s or Benjamini–Hochberg’s—that recommend adjusting significance thresholds downward as the number of hypotheses increases. Yet the objection applies equally to a simple objective Bayesian method that I discuss later; the method adjusts for multiplicity by uniformly decreasing the prior probabilities assigned to hypotheses as the number of hypotheses grows.
According to critics, the justification of such methods implies that one should adjust/“correct” for any chosen set of hypotheses. But that’s absurd because one would be required to adjust for every statistical hypothesis that has ever been formulated. This motivates thinking that the answer to the central question is, “One should never adjust for multiplicity, and intuitions to the contrary are misleading.”
The toy results show how simple Bayesian thinking can partially answer the objection. Prior evidence or background theory may tell us that certain hypotheses are dependent, and in such cases, belief adjustment will almost certainly be necessary. Further research should investigate whether the most common classical adjustment methods (see subsection 3.2) can ever be interpreted as reflecting belief adjustment.
One might object that the aforementioned definition of adjusting “belief” is too simple to model some common statistical practices. The problem is that the same probability measure $P$ appears on both sides of equation (3). So the definition is inapplicable for assessing whether “objective” Bayesian methods require adjustment.
Recall that objective Bayesians maintain that the prior probability that one assigns to hypothesis $H$ may vary with the hypothesis space in which $H$ is embedded. For example, consider an attempt to identify which genes are associated with which heritable diseases. For each gene and disease under investigation, researchers may investigate a hypothesis ${H_{g,d}}$ of the form “Gene $g$ is associated with the disease $d$ .” In an objective Bayesian analysis, each hypothesis ${H_{g,d}}$ will typically receive lower prior probability if there are $20,000$ genes under investigation than it would receive if there were $10,000$ genes under consideration.
I will not compare the merits of objective versus subjective Bayesian analysis. Footnote 9 But simple objective Bayesian adjustment methods deserve further scrutiny. Imagine our hypothetical pharmaceutical researcher wonders about the effect of El Niño on the stock market. The mere contemplation of a new hypothesis should not automatically cause the researcher to become less confident in the efficacy of the new cancer treatment.
Yet considering additional—logically independent—hypotheses can affect an objective Bayesian’s prior probabilities if those probabilities are chosen in a mechanical fashion as a function of the number of hypotheses.
Objective Bayesians might respond that a prior distribution need not represent anyone’s beliefs. Footnote 10 Rather, a prior should be treated as part of a decision rule. I agree, and I consider decision-making in the next section. For now, note that it is similarly implausible that a pharmaceutical researcher should adjust her decisions about the efficacy of the cancer treatment after contemplating El Niño. Saying that the researcher’s prior need not represent her beliefs does not explain why adjustment is not necessary.
3 Decision
Scientists are rarely satisfied with an answer to the question, “What should I believe?” They also want to know, “What should I do?” For instance, an experimentalist might want to know which experiment she should conduct next.
Imagine that for each hypothesis ${H_k}$ , there is some set of acts ${A_k}$ that the researcher might take. For instance, a researcher might announce that the hypothesis ${H_k}$ has been rejected or that it’s been retained. She might collect more data about ${H_k}$ or cease an experiment. And so on.
I call elements of ${A_k}$ component acts, and I define a strategy to be a set $S$ of component acts such that for all $k$ , either $S \cap {A_k}$ is a singleton or empty. That is, at most, one act can be taken with respect to a hypothesis. A decision rule $d$ maps subsets of (values of) the observable variables ${X_1}, \ldots {X_N}$ to strategies. I require that $d\left( {{X_{{k_1}}} = {x_{{k_1}}}, \ldots, {X_{{k_m}}} = {x_{{k_m}}}} \right)$ contains precisely one element from each of the sets ${A_{{k_m}}}$ . That requirement says that a decision rule specifies actions only with respect to hypotheses for which the researcher has collected data, and that if researcher observes ${X_k}$ , then she must take some action in ${A_k}$ .
I say that a decision rule $d$ adjusts for multiplicity if there is some ${x_1}$ such that
for all values ${x_2}, \ldots {x_N}$ of ${X_2}, \ldots {X_N}$ .
Do any plausible decision rules require adjusting? Again, yes. For a Bayesian, reporting one’s posterior probabilities is a decision. So belief adjustment is a special case of decision adjustment. A better question is, “Can there be decision adjustment without belief adjustment, and what goals, if any, does decision adjustment achieve?”
Before discussing the standard approach for evaluating testing procedures (in terms of the family-wise error rate or false-discovery rate), I begin with the most naive, decision-theoretic approach for answering these questions. The naive approach is worth sketching because (1) it is, I think, the correct approach when it can be employed, Footnote 11 and (2) it helps one identify the oddness of the goals that are presumed in standard discussions of adjustment.
3.1 A naive approach
Suppose a researcher assigns a utility $u\left( {S,\theta } \right)$ to each strategy $S$ and vector $\theta \in {\rm{\Theta }}$ specifying which of the $N$ hypotheses are true. If we fix a vector $\theta \in {\rm{\Theta }}$ , then the researcher’s expected utility (with respect to ${\mathbb{P}_\theta }$ ) can be defined straightforwardly, whether she decides to observe one variable or all $N$ variables: Footnote 12
Here, ${{\cal X}_1}$ is the range of ${X_1}$ , and ${\cal X}$ is the range of the random vector $\vec X = \left( {{X_1}, \ldots, {X_N}} \right)$ . One can now apply standard decision-theoretic terms to identify different senses in which a decision rule is good or bad.
For instance, a researcher might desire a maximin decision rule, that is, a rule $d$ such that ${\rm{mi}}{{\rm{n}}_{\theta \in \theta }}\mathbb{E}_\theta ^j\left[ d \right] \ge {\rm{mi}}{{\rm{n}}_{\theta \in \theta }}\mathbb{E}_\theta ^j\left[ e \right]$ for all decision rules $e$ , where $j = 1$ or $j = N$ . Alternatively, she might be a Bayesian; that is, she might always select a (subjective) expected-utility-maximizing strategy with respect to her posterior. Recall that the subjective expected utility of a strategy $S$ with respect to a measure $P$ is given by the following:
Thus, there is a Bayesian who will adjust for multiplicity if there is a probability measure $P$ , utility function $u$ , and experimental outcomes $\vec x = \left( {{x_1}, \ldots, {x_N}} \right) \in {\cal X}$ such that three conditions hold:
-
1. $P\left( {\vec X = \vec x} \right) \gt 0$ ;
-
2. ${a_1}$ maximizes ${\mathbb{E}_{P( \cdot |{X_1} = {x_1})}}\left[ a \right]$ over all $a \in {A_1}$ ; and
-
3. ${a_1} \notin S$ for some $S$ that maximizes ${\mathbb{E}_{P( \cdot |\vec X = \vec x)}}\left[ T \right]$ , where $T$ ranges over strategies containing a component act in every ${A_k}$ .
We can now make the central question more precise in a second way. For which utility functions do standard nonprobabilistic decision rules like maximin adjust for multiplicity in the sense of equation (7)? Similarly, for which priors and utility functions does an expected-utility maximizer adjust for multiplicity?
For simplicity, assume that a decision-maker’s utilities are separable across component acts in the following sense. Footnote 13 Assume that, for each hypothesis ${H_k}$ , there is a “component” utility function ${u_k}:{A_k} \times \left\{ {0,1} \right\} \to \mathbb{R}$ that specifies the utilities $u\left( {a,0} \right)$ and $u\left( {a,1} \right)$ of taking action $a \in {A_k}$ when ${H_k}$ is true and false, respectively. Further, suppose that the utility of a strategy $u\left( {S,\theta } \right)$ in state $\theta $ is the sum of the utilities of component acts, that is:
Utilities are separable when (a) the decision-maker can take component acts in parallel, and (b) payoffs for taking different component acts do not interact. Such assumptions are most plausible when two conditions are met. First, acts are cheap, or the decision-maker has plentiful resources (and so pursuing multiple projects in parallel is not prohibitively costly). Second, the hypotheses concern unrelated phenomena (so that the important theoretical consequences of a conjunction of hypotheses is the union of the theoretical consequences of the conjuncts). If the decision-maker is a grant-making institution like the National Science Foundation (NSF) or NIH, then utilities associated with projects in different scientific fields are plausibly separable. The size of the institution makes funding projects in parallel possible, and it is rare to find results in two disparate scientific fields that, when taken together, yield important insights that neither result yields by itself.
The next theorem suggests that when utilities are separable, adjustment is never obligatory, and it is sometimes impermissible. Footnote 14
Theorem 1 Suppose utilities are separable in the sense of equation (9). Then there are maximin rules that do not adjust for multiplicity. If in addition, the hypotheses of ${\rm{\Theta }}$ are mutually independent with respect to $P$ , then one can maximize (subjective) expected utility with respect to $P$ without adjusting. It follows that if the maximin rule is unique, then no decision rule that adjusts is maximin. Similar remarks apply to expected-utility maximization.
One might object that individual scientists will rarely have separable utilities for the reasons identified earlier. Component acts are often costly: Pursuing one project typically comes at the expense of pursuing another. And even if the component acts are cheap (e.g., making an announcement), it is rare that scientists investigate hypotheses that are so unrelated that if the conjunction were true, no further important insights would follow. Scientists are highly specialized, and thus they typically study hypotheses that are related.
However, I have not identified the necessary conditions for separability; utility functions might be (approximately) separable for other reasons. More importantly, theorem 1 yields sufficient conditions for nonadjustment, not necessary ones. So a suspicion that theorem 1 is rarely applicable does not justify decision adjustment for individual researchers. The theorem shifts the burden to providing a positive argument for adjustment.
The reader might speculate that given the extensive research on multiplicity, statisticians have (i) identified utility functions that plausibly represent the interests of scientists and (ii) shown that common adjustment procedures are uniquely maximin, or expected-utility maximizing with respect to those utility functions. Unfortunately, that’s not the case. Some classical procedures for multiple testing are, in fact, inadmissible (i.e., weakly dominated) for plausible utility/loss functions. Footnote 15 Thus, the criteria used to justify standard classical testing procedures are more complex than they might initially seem; I turn to those criteria now.
3.2 Family-wise error rates and false-discovery rates
Classical approaches to multiple testing typically aim to control either the family-wise error rate (FWER), which is the probability that a series of tests yields at least one false positive, or the false-discovery rate (FDR), which is the expected proportion of rejected null hypotheses that are true.
Statisticians routinely say that the FWER is rarely of interest. I agree. The FWER is almost always maximized when all null hypotheses are false. But in many applications, researchers know that at least one null hypothesis is false. Consider again genome-wide association studies that investigate the associations between thousands of genes and multiple heritable diseases. If at least one disease is known to be heritable and genes are the mechanism for inheritance, then there must be at least one gene that is associated with at least one disease!
Thus, some researchers now insist that multiple-testing regimes should control the FDR. If the FDR is identical to one’s loss function, are existing regimes maximin? Do they ever minimize subjective expected loss? The answer to both questions is clearly no. One minimizes the FDR (or FWER) by retaining all null hypotheses. Thus, as is standard in classical hypothesis testing, existing multiple-testing procedures typically (i) fix a threshold for FDR and (ii) attempt to maximize power (i.e., the probability of a false negative) subject to the constraint that the FDR is below the threshold. Assuming utility is identified with (some kind of) power, statisticians have identified testing regimes that are maximin among the set of procedures that maintain FDR and/or FWER below a threshold. Footnote 16
I will not rehearse standard objections to maximin reasoning, Footnote 17 nor to the bizarre two-step procedure in which one first culls testing procedures using FDR and then applies maximin. Instead, I emphasize that the decision criteria just described (1) treat all null hypotheses equally, (2) treat null hypotheses differently from alternatives, and (3) ignore effect sizes. However, there are virtually no circumstances in which such equal treatment and dismissal of effect size reflects either scientific or public interest.
Consider a recent influential genome-wide study in which researchers tested roughly 14,000 genes for associations with seven common diseases, which included bipolar disorder and Crohn’s disease (Wellcome Trust Case Control Consortium 2007). Although the authors of the study reported adjusted $p$ -values, they also laudably applied many statistical techniques, incorporated background genetic knowledge, and avoided making policy recommendations based solely on adjusted $p$ -values. Why did they not simply apply a testing procedure with good power subject to the control of FDR?
All seven diseases they considered are serious, but the incidence of each varies widely, as do the cost and efficacy of available treatments. From a public health perspective, therefore, it would be inappropriate to treat every hypothesis of the form “Gene $g$ is associated with disease $d$ ” equally and to ignore the strength of such associations.
One might object that the severity of the diseases does not affect the evidence for the various hypotheses. Does adjustment somehow reflect one’s evidence?
Answering that question is beyond the scope of this article; I lack the space to explore the relationship among evidence, belief, and decision. Footnote 18 But I am skeptical of both (a) the importance of the question and (b) an answer that involves classical procedures that control FDR or FWER.
Concerning (a), philosophers and scientists alike should be wary of directives to ignore the suffering caused by diseases and instead coldly evaluate only the evidence for empirical hypotheses. I admit that a subjective expected-utility analysis of genome-wide studies seems daunting. I have no idea how to define a prior over a roughly 100,000-dimensional (i.e., approximately $7 \cdot 14,000$ ) parameter space that incorporates expert knowledge. Nor do I have any idea how to define a utility function that balances considerations of the severity and incidence of different diseases. But I stress that mechanical use of multiple-testing procedures amounts to a refusal to engage with questions of ethical importance, not an answer.
Concerning (b), like many classical procedures, decision criteria that first cull tests by FWER or FDR treat null hypotheses differently from the alternatives. But if evidential strength is divorced from pragmatic and ethical considerations, it is hard to see how the asymmetric treatment of null and alternative hypotheses could reflect anything evidential: What could distinguish a hypothesis $H$ from its negation $\neg H$ , evidentially speaking?
4 Conclusions
The goals of scientists and of the public may be misaligned with the decision criteria used to evaluate multiple-testing regimes. Thus, I urge two broad projects for future research.
First, in scientific contexts in which large numbers of statistical hypotheses can be tested, scientists and philosophers must study the interests of the affected parties. The differential funding provided for medical research—in comparison to academic philosophy, for instance—is typically justified by its social importance. Scientists should make good on that promise to advance collective interests. Footnote 19
Second, statisticians must prove that existing testing procedures advance the interests of affected parties, or they must develop alternative procedures altogether. Otherwise, we all stand to be bamboozled by Bonferroni.
Supplementary material
For supplementary material accompanying this paper visit https://doi.org/10.1017/psa.2024.13