1 Introduction
When researchers design an experiment, they must choose among many different experiments that could test the same general claim. Clifford, Leeper, and Rainey (Reference Clifford, Leeper and Rainey2024) and Clifford and Rainey (Reference Clifford and Rainey2024) refer to these experiments as varying in their “topic,” but “topic” can refer to substantive or ancillary details.Footnote 1 It is reasonable to suppose that each of these possible experiments has a different (average) treatment effect ${\delta}_j = E\left({y}_{i\left[j\right]}\mid {T}_{i\left[j\right]} = 1\right)-E\left({y}_{i\left[j\right]}\mid {T}_{i\left[j\right]} = 0\right)$ , where ${y}_{i\left[j\right]}$ represents the potential outcome for respondent $i$ under topic $j$ and ${T}_{i\left[j\right]}$ represents the treatment indicator. However, researchers might want to generalize beyond any single topic that they might use in their experiment.Footnote 2
Chong and Druckman (Reference Chong and Druckman2013, 14) directly discuss the tension between their inferences about a single topic and inferences to their larger theoretical population:
Our results are potentially circumscribed by our focus on a single issue and a single approach to operationalizing attitude strength. However, we believe our theory should apply to any issue, including hotly debated issues on which most people hold strong prior opinions; attempts to frame public opinion on such issues will be more difficult or may fail outright.
This quote highlights a weakness of single-topic studies: the variation in treatment effects across topics might be large relative to the sampling variability in the estimate for the single topic. That is, the standard error from a single-topic study reflects the sampling error in estimating the treatment effect for that topic. The standard error does not include the error from choosing an unrepresentative topic, and this topic-level error might be large. Experimenters are sensitive to this issue. The quote from Chong and Druckman (Reference Chong and Druckman2013) is not unrepresentative.Footnote 3 Thus, our goal with this paper is not to make experimentalists aware of this general issue—they are clearly well-aware and mindful of the problem. Instead, our goal is to use a formal framework to make the limitations of single-topic studies concrete and precise.
To generalize beyond a single topic, Clifford, Leeper, and Rainey (Reference Clifford, Leeper and Rainey2024) and Clifford and Rainey (Reference Clifford and Rainey2024) suggest marginalizing across “topics” or across the collection of possible experiments that the researcher might use to test the same conceptual claim. They define a “typical” treatment effect $\overline{\delta}$ as the average of the treatment effects ${\delta}_j$ across topics. To estimate the typical treatment effect across topics, they suggest topic sampling: (1) take a random sample of topics from the larger population, (2) run parallel arms of the experiment for each topic, and (3) use a hierarchical model to pool the estimates and summarize the heterogeneity. Clifford, Leeper, and Rainey (Reference Clifford, Leeper and Rainey2024) and Tappin (Reference Tappin2023) offer recent examples of this basic approach and Clifford and Rainey (Reference Clifford and Rainey2024) describe the estimators in detail.Footnote 4
The precise statement of the quantity of interest—the “typical” treatment effect across topics—allows a precise description of the slippage between studies of single topics and inferences about the larger collection. We explore this disconnect by focusing on the properties of confidence intervals (CI) from single-topic studies. We ask: how often does the 95% CI from a single-topic study capture the typical treatment effect? Footnote 5 We focus on two scenarios: (1) when the researcher chooses a topic at random and (2) when the researcher intentionally chooses an unrepresentative topic with a large treatment effect. By thinking carefully about the coverage of these intervals, we can reason precisely about the limits and strengths of single-topic studies.
2 A Randomly Chosen Topic
First, we consider the CIs for a randomly chosen topic. That is, in each repetition of the experiment, the researcher uses a new topic selected at random from the population. When there is any variation in the treatment effects across topics, the CIs from the single-topic study are too narrow to cover the typical treatment effect at the nominal rate.Footnote 6 As the variation across topics becomes large and/or the sampling variability in the single-topic estimate becomes small (e.g., large $N$ ), then the coverage (slowly) approaches 0%.Footnote 7
Consider a stylized single-topic design. In this stylized design, the researcher selects $N$ subjects and assigns half to treatment and half to control, then computes the difference in the two means to estimate the treatment effect. Assuming equal variance ${\sigma}_y^2$ , this difference has a standard error ${\mathrm{SE}}_{\delta_j} = \sqrt{\frac{2{\sigma}_y^2}{\frac{1}{2}N}} = \sqrt{\frac{4{\sigma}_y^2}{N}} = \frac{2{\sigma}_y}{\sqrt{N}}$ . Importantly, this standard error models the error in the estimated treatment effect due to randomly assigning respondents to treatment and control. This standard error does not include the error from choosing an unrepresentative topic.
Although it seems obvious that this standard error is for the estimate of the treatment effect for the particular topic, researchers might be tempted to expand their conclusions about the specific topic to the broader collection of possible topics, which might better represent the theoretical claim. For example, Bakker, Lelkes, and Malka (Reference Bakker, Lelkes and Malka2020, 1073) use food irradiation and farm subsidies to determine which motive—bounded rationality or expressive utility—is “more salient for partisan cue-taking.” They specifically note the limitations on generalizability, but their broader claim is more general than food irradiation and farm subsidies. Similarly, Nicholson (Reference Nicholson2012, 64) studies immigration and foreclosure policy and concludes that “out-party leaders play a more potent role in shaping partisan opinion.” The author again addresses limitations on generalizability, but aims to generalize beyond immigration and foreclosure. Thus, using specific topics to test a broader theoretical claim makes it tempting to generalize beyond the specific topics. So, we ask: “under what conditions is it ‘particularly bad’ to generalize beyond the specific topic(s) of the experiment?” To answer this question, we use the framework from Clifford and Rainey (Reference Clifford and Rainey2024) to examine how often the single-topic CI captures the typical treatment effect in the broader population of topics.
Suppose that the particular treatment effect differs from the typical treatment effect by ${\psi}_j$ and that ${\psi}_j$ is a random variable with variance ${\sigma}_{\psi}^2$ .Footnote 8 The parameter ${\sigma}_{\psi }$ captures the similarity in treatment effects across topics. Treating ${\psi}_j$ as random (e.g., the single topic is selected at random), the standard error of the difference-in-means becomes $\sqrt{{\mathrm{SE}}_{\delta_j}^2+{\sigma}_{\psi}^2}$ . This implies that the standard error for the single-topic estimate needs to be $\sqrt{\frac{\sigma_{\psi}^2}{{\mathrm{SE}}_{\delta_j}^2}+1} = \sqrt{\frac{\sigma_{\psi}^2}{\frac{4{\sigma}_y^2}{N}}+1}$ times larger to capture the typical effect across topics at the nominal rate.Footnote 9 Alternatively, the standard error for the single-topic study is only $\left(1/\sqrt{\frac{\sigma_{\psi}^2}{\frac{4{\sigma}_y^2}{N}}+1}\;\right)\times 100\%$ as wide as the standard error for the typical treatment effect.
The party cue experiment in Clifford, Leeper, and Rainey (Reference Clifford, Leeper and Rainey2024) allows us to select reasonable values of ${\sigma}_y$ and ${\sigma}_{\psi }$ . We use their data to estimate these parameters and obtain ${\sigma}_y = 1.87$ and ${\sigma}_{\psi} = 0.18$ .Footnote 10 If we set $N = \mathrm{1,000}$ , the deflation factor is $1/\sqrt{\frac{\sigma_{\psi}^2}{\frac{4{\sigma}_y^2}{N}}+1}\approx 0.55$ . Therefore, with a sample size of 1,000, the CI from a single-topic study is only about 55% as wide as it needs to be to capture the typical effect at the nominal rate.Footnote 11 For sample sizes ranging from 100 to 3,000, Panel A of Figure 1 shows that the deflation factor ranges from 90% to 35%.
Alternatively, we can compute the coverage of these intervals. How often does the CI from the single-topic study capture the typical treatment effect?Footnote 12 For the single-topic study with ${\sigma}_y = 1.87$ , ${\sigma}_{\psi} = 0.18$ , and $N = \mathrm{1,000}$ , the 95% CI captures the typical treatment effect 72% of the time. This might seem surprisingly high. After all, the interval is only about half as wide as it needs to be to capture the typical effect 95% of the time. However, this interval fails to capture the typical treatment effect 28% of the time. If the researcher uses the single-topic study to infer the direction of the typical effect, the size of the test is 14% rather than the nominal 2.5% (one-tailed)—increasing the error rate by about 460%. The bottom panel of Figure 1 shows that the coverage ranges from 92% to 51% as the sample size ranges from 100 to 3,000.
To solidify the intuition for this process, Panels C and D of Figure 1 show many 95% CIs for single-topic studies with N = 1,000 and an alternative interval with nominal coverage. While the point estimates in Panel C equal the typical effect on average, the CIs are much too narrow to consistently capture the typical effect. In Panel D, we increase the width of the CIs (multiplying by c) to obtain the nominal coverage rate of 95%. These panels clearly show that the interval from the single-topic study is much too narrow. Many 95% CIs miss the typical effect and some miss badly.
Perhaps ironically, as estimates from a single-topic study become more precise, they become less likely to capture the typical treatment effect. One possible incorrect conclusion is that researchers should therefore prefer smaller samples sizes and wider CIs because they are more likely to capture the typical effect; that’s not correct. To study the typical effect, researchers must build on a careful collection of singletopic studies and generalize using appropriate tools (e.g., Clifford, Leeper, and Rainey Reference Clifford, Leeper and Rainey2024). So long as researchers remain mindful of the limitations, more precise estimates from single-topic studies are helpful in this effort to generalize.
3 An Intentionally Unrepresentative Topic
In addition to a randomly chosen topic, we consider the behavior of the 95% CI when using an intentionally unrepresentative topic. When initially testing an idea, researchers might use a topic that they expect creates especially large treatment effects. For example, in a study of partisan cues, Levendusky (2010, 119) selected issues for which respondents would have “weak prior beliefs” such that the “experimental manipulation—and not the respondent’s pre-existing opinion—is the key source of information about the issue.”Footnote 13 This can be a useful tool—it allows the researcher to obtain high statistical power without a large, expensive sample. But how often will the 95% CI from this single-topic study capture the typical treatment effect?
To analyze this situation, we assume that the researcher chooses a topic with a treatment effect that falls one standard deviation above average in the population of topics (or at the 84th percentile). Using the same values of ${\sigma}_y = 1.87$ , ${\sigma}_{\psi} = 0.18$ , and $N = \mathrm{1,000}$ , only about 67% of the 95% CIs will capture the typical effect (almost all misses are above the typical effect). Perhaps more starkly, about one in three 95% CIs will not include the typical treatment effect. This means that the size of the test is 33%. If the researcher uses the single-topic study to infer that the typical effect is positive, the error rate under the null will be about 33% rather than the nominal 2.5% (one-tailed)—increasing the error rate by about 1,220% above the nominal rate. Panel A of Figure 2 shows that the capture rate ranges from 92% to 25% for sample sizes from 100 to 3,000. Panel B shows several simulated 95% CIs from this single-topic study to clarify the behavior of the CIs. These intervals are not too narrow, they are simply shifted above the typical effect (by design).
4 Conclusion
In this research note, we consider the generalizability of single-topic studies. Researchers and readers must remember that single-topic studies are studies of particular topics and the estimates for that topic do not allow the researcher or reader to generalize to other topics without assumptions about the similarity of the topics. Experimentalists certainly appreciate that findings from one topic might not generalize to others, but we provide a concrete, statistical description of the problem that allows researchers and readers to properly limit their inferences and understand the risks of over-generalization.
We focus on how often the CIs from the single-topic studies capture the typical treatment effect and show that coverage is much lower than the nominal rate. Using hypothetical, but plausible parameters, we show that for a study with 1,000 respondents (1) the 95% CI for a randomly selected topic might capture the typical treatment effect about 71% of the time (missing high and low) and (2) the 95% CI for an intentionally chosen topic with a large treatment effect might capture the typical treatment effect 67% of the time (missing high). These are only illustrations for a plausible scenario and the properties will vary across contexts. However, our plausible scenario clearly illustrates the slippage between studies of single topics and inferences about other possible experiments.
We emphasize three conclusions. First, single-topic studies have limits, and we encourage researchers to remain mindful of those limits. Researchers should (continue to) take care when drawing inferences about a larger population of topics from a single-topic study, particularly regarding the size of an effect.
Second, there is, of course, a use for single-topic studies. They can be useful early in a research program to convincingly establish a particular effect because they allow researchers to obtain high statistical power without huge sample sizes. For many research problems, it seems reasonable to assume that treatment effects across topics will tend to have the same direction. If researchers are comfortable with this working assumption, then single-topic studies allow researchers to establish the plausibility of the theory relatively efficiently: they can select a topic with an especially large treatment effect and thus dramatically increase their statistical power. A common concern is pretreated respondents (e.g., Slothuus Reference Slothuus2016). But careful selection of a single topic can allow researchers to avoid small effects due to pretreatment.Footnote 14 Of course, the generalizability of the treatment effect of the hand-picked topic—even the direction of the effect—remains an assumption in the absence of studies of other topics. Thus, single-topic studies are an important first step in the research process but should not be the last. With several careful studies of single topics to draw upon, researchers can improve generalizability by investing resources into systematically studying multiple topics (e.g., Clifford, Leeper, and Rainey Reference Clifford, Leeper and Rainey2024; Clifford and Rainey Reference Clifford and Rainey2024) rather than pouring all their resources into a careful study of a single topic. Nonetheless, a collection of careful studies of single topics remains a valuable foundation—perhaps an essential foundation—for more generalizable work.
Third, this discussion highlights the statistical and substantive importance of the similarity of treatment effects across topics. When one conducts a single-topic study it is common and natural to speculate about how the estimated effects might generalize to other topics. This leads to researchers to think about the variability of effects across topics and leads naturally to the topic-sampling framework (Clifford, Leeper, and Rainey Reference Clifford, Leeper and Rainey2024; Clifford and Rainey Reference Clifford and Rainey2024). This motivation highlights that knowledge of the variability across topics is critical for drawing inferences about the typical effect. But estimating the variability is not only statistically helpful, it is also substantively meaningful. The variability in treatment effects across topics is a substantive quantity of interest that researchers should consider further. The framework from Clifford and Rainey (Reference Clifford and Rainey2024) allows us to describe the impact of variability in treatment effects across topics on inferences. But, more importantly, their framework also allows researchers to easily design experiments to estimate the similarity and the typical treatment effect—what we assume is (or at least, might sometimes be) the ultimate quantities of interest.
Data Availability Statement
All code to reproduce our results is available on Dataverse at https://doi.org/10.7910/DVN/QX0BSK (Rainey Reference Rainey2024).