de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016) argue that discussion of the 5-HTTLPR-stress gene–environment interaction (G × E) (Caspi et al. Reference Caspi, Sugden, Moffitt, Taylor, Craig, Harrington, McClay, Mill, Martin, Braithwaite and Poulton2003) is more positive than merited because authors often cast negative results as positive in abstracts, and negative papers with positive focus are differentially cited. These bold claims deserve careful scrutiny. Four methodological choices we highlight bias their primary results; the vast majority of papers disclose mixed and negative results in their abstracts (Table 1). Further, even if positive focus was prevalent, it could not bias meta-analytic results. The field can best move forward by ameliorating environmental measurement.
5-HTTLPR, Serotonin transporter gene; RD, risk difference; CI, confidence interval; BDNF, brain-derived neurotrophic factor; OR, odds ratio; SNP, single nucleotide polymorphism; MDD, major depressive disorder; G × E, gene–environment interaction.
a The primary focus of the paper was something other than 5-HTTLPR G × E for depression.
b We debated whether to expect papers with a focus other than the 5-HTTLPR G × E (but which included it as an ancillary test) to report on this G × E in their abstracts. These include tests of G × G × E effects and one additive (G + G) × E test. To be conservative, we report results both ways. In each noted case, a paper tests a more complex effect but does not fully characterize the ancillary 5-HTTLPR G × E in the abstract.
Methodological concerns
de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016) coded papers’ full results sections as positive or negative, then compared this with abstract conclusion sentences’ positivity. Four choices that lead to errant conclusions contrast decisions reflecting care not to bias results – selecting the smallest p value when both traditional and triallelic results were available, and when both adjusted and unadjusted results were available. Similarly, sensitivity analyses using the lowest p value should address several issues, but still provide ‘positive focus’ results that contradict disclosures we extracted from abstracts. We focus our comments on their primary approach, which informs their conclusions.
Averaging p values
When papers included multiple G × E p values, the authors averaged them in their primary analyses, an approach biased toward negative conclusions. For a hypothetical paper with three findings at the p = 0.001 level and one finding at the p = 0.300 level, the average of the four is non-significant by traditional standards, p = 0.076. But who would conclude such a paper was negative overall? Although the most inclusive 5-HTTLPR and life stress G × E meta-analysis took a similar approach (Sharpley et al. Reference Sharpley, Palanisamy, Glyde, Dillingham and Agnew2014), a bias for negative conclusions could be entirely appropriate for a meta-analysis that ultimately has positive conclusions. However, the negative bias favors the perspective of de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016).
Dichotomizing averaged p values
The authors imposed a false negative/positive dichotomy on averaged p values. For example, Jenness et al. (Reference Jenness, Hankin, Abela, Young and Smolen2011) reported a significant interaction for 5-HTTLPR with family chronic stress (p = 0.02) but not with recent stressful life events (p = 0.88), leading to a negative classification by de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016) (average p = 0.46). Despite disclosure of mixed findings in their abstract (Table 1), de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016) labeled their work as having partially positive focus relative to ‘negative’ findings. An alternative if imperfect approach is to classify papers across at least three categories (positive, negative and mixed), then evaluate abstracts for fidelity to actual findings.
Unbiased or atheoretical?
The primary approach assumes each of the averaged p values are equally valid, an approach which runs roughshod over theory. Several papers specifically hypothesized that one of their tests was more valid than another – sensitivity testing that refines G × E research and ought to be highly cited – accordingly presented the results of both approaches, and found support for their hypothesis. Uher et al. (Reference Uher, Caspi, Houts, Sugden, Williams, Poulton and Moffitt2011) found support for Brown & Harris's (Reference Brown and Harris2008) hypothesis that the childhood adversity G × E predicts persisting depression, p = 0.003, but not single-episode depression, p = 0.231 (a finding replicated elsewhere; Brown et al. Reference Brown, Ban, Craig, Harris, Herbert and Uher2013). These results transparently appear in their abstract (Table 1), yet the faulty assumption that these tests are equally valid leads de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016) to classify Uher et al.’s (Reference Uher, Caspi, Houts, Sugden, Williams, Poulton and Moffitt2011) and Brown et al.’s (Reference Brown, Ban, Craig, Harris, Herbert and Uher2013) papers as negative with positive focus. Although sensitivity analyses selecting the lowest p value ought to allow for theory to favor a particular test, we identified abstract sentences disclosing results for more papers than these analyses suggest.
Evaluation of abstract conclusion sentences not full abstracts
To determine whether abstracts had overly positive focus, the authors rated the conclusion sentence(s), not the full abstract. Such a selective approach disregards an abstract's ‘gestalt’ without any rationale for doing so. Where is the evidence that researchers cite papers based on abstract conclusion sentences? In contrast to the authors’ assertions, we were able to identify very clear acknowledgement of mixed results in all but seven of the 22 abstracts characterized as having (partially) positive focus (Table 1).
Results of alternative rating approach
To estimate these decisions’ impact on the positive focus ratings of de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016), we rated the 38 ‘negative’ papers. Two raters examined results, assigning negative, or mixed classifications, and examined the full abstract to determine whether negative or mixed results were not disclosed (ratings appear in online Supplementary Table S1). We extracted sentences demonstrating disclosure (Table 1). We deemed it unfair to papers with a primary focus other than the 5-HTTLPR G × E (e.g. focus on a G × G × E), but which included it as an ancillary test, to expect they report G × E results in their abstract; to be conservative, we present results both ways. Group discussion adjudicated non-matching ratings. Of these 38 ‘negative’ studies, we characterized them as 58% (n = 22) negative and 42% (n = 16) mixed. We assigned (partially) positive focus ratings to four to seven of the 22 articles that de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016) characterized as having (partially) positive focus (depending on treatment of papers with a focus other than the 5-HTTLPR G × E). We conclude that the ratings of de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016), which form the basis for evaluation of citation bias, are fundamentally flawed.
Sensitivity analyses using the lowest p value still do not square with evidence that authors disclosed results (Table 1): these indicate 12 have (partially) positive focus relative to our four to seven. Moreover, the authors suggested that sensitivity analyses did not markedly influence their findings (for citation bias), but their effect size of (partially) positive focus drops by 26% relative to their negative ratings (22/38 to 12/28) and by 45% relative to the population of 73 studies. Their procedures have a marked impact on estimating the prevalence of positive focus. We observe that this is not reported in their abstract.
Biased conclusions
A conclusion the authors draw in their own abstract is noteworthy: ‘discussion of the 5-HTTLPR–stress interaction is more positive than warranted’. How positive should the discussion be? Clearly, this is controversial. On the one hand, there have been two negative meta-analyses that included a small number of reports to use homogeneous designs (k = 5 and 14, respectively; Munafò et al. Reference Munafò, Durrant, Lewis and Flint2009; Risch et al. Reference Risch, Herrell, Lehner, Liang, Eaves, Hoh, Griem, Kovacs, Ott and Merikangas2009), many G × E investigations are under-powered (Duncan & Keller, Reference Duncan and Keller2011), and we observed some questionable research practices as we read. On the other hand, inclusive meta-analyses from Karg et al. (Reference Karg, Burmeister, Shedden and Sen2011) (k = 54) and Sharpley et al. (Reference Sharpley, Palanisamy, Glyde, Dillingham and Agnew2014) (k = 81) both reach positive conclusions, with Sharpley et al. (Reference Sharpley, Palanisamy, Glyde, Dillingham and Agnew2014) showing that the meta-analytic effect emerges across four separate design subtypes. Karg et al. (Reference Karg, Burmeister, Shedden and Sen2011) show that differences between the negative meta-analyses and theirs are due to paper selection, not meta-analytic technique. Papers selected for their statistically homogeneous designs tend to have methodological flaws including retrospective lifetime stress and depression assessment (Moffitt & Caspi, Reference Moffitt and Caspi2014) leading to confounding (Uher & McGuffin, Reference Uher and McGuffin2010). Moreover, Karg et al. (Reference Karg, Burmeister, Shedden and Sen2011) show that reports with more robust measures of stress (interview and objective measures) possess a more robust meta-analytic effect, so much that others observe an almost 1:1 relationship between stress measurement quality and likelihood of at least partial G × E effect replication (Uher & McGuffin, Reference Uher and McGuffin2010). Neither positive focus nor citation bias influences this evidence. There is at least a reasonable basis for concluding that this is a legitimate G × E effect. Thus, when papers characterize the results of the 5-HTTLPR G × E literature positively and cite positive studies, how is this ‘more positive than warranted?’
Where to go from here?
There is a much larger problem – and opportunity for progress – in G × E depression research. The unique environment contributes roughly 60% of risk to depression (Sullivan et al. Reference Sullivan, Neale and Kendler2000), but in G × E research we often fail to invest in environmental measurement. Many G × E researchers measure the environment with insufficiently valid measures (for discussion, see Monroe & Reid, Reference Monroe and Reid2008; Uher & McGuffin, Reference Uher and McGuffin2010; Karg et al. Reference Karg, Burmeister, Shedden and Sen2011; Sharpley et al. Reference Sharpley, Palanisamy, Glyde, Dillingham and Agnew2014). But in addition, we must all more carefully conceptualize the ‘candidate environment’.
Recent work supports that chronic stress and major severity interpersonal stress were consistent unique predictors of depressive episode onset across two samples of emerging adults, whereas minor stressors were never unique predictors and non-interpersonal stressors were rarely so (Vrshek-Schallhorn et al. Reference Vrshek-Schallhorn, Stroud, Mineka, Hammen, Zinbarg, Wolitzky-Taylor and Craske2015). Early evidence indicates that these distinctions matter for G × E tests: Whereas no G × E effect emerged for minor events, consistent with expectations, an overall G × E effect between 5-HTTLPR and major events was accounted for exclusively by major interpersonal events and not non-interpersonal ones (Vrshek-Schallhorn et al. Reference Vrshek-Schallhorn, Mineka, Zinbarg, Craske, Griffith, Sutton, Redei, Wolitzky-Taylor, Hammen and Adam2014). All forms and severities of stress are not created equal. As G × E research moves beyond 5-HTTLPR, we hope the field will work toward large-scale G × E research with valid, thoughtfully conceptualized environmental measures.
Conclusions
Although positive focus sometimes occurs in G × E research, as we expect it unfortunately does throughout science, through their methodological choices, the paper of de Vries et al. (Reference de Vries, Roest, Frantzen, Munafò and Bastiaansen2016) exemplifies bias. Four choices including classifying abstracts by only their conclusion sentence bias the primary results. Sensitivity tests do not overcome these issues. Ultimately, depression–genetics research enterprise aims to enhance prediction and intervention for depression. It is time we all renewed our ‘positive focus’ on that goal.
Supplementary material
The supplementary material for this article can be found at http://dx.doi.org/10.1017/S0033291716002178
Acknowledgements
The authors thank Drs Paul Silvia and Thomas Kwapil who provided comments on an earlier version.
Declaration of Interest
None.