Introduction
There are few concepts more vital to the integrity of a scientific discipline than replication. Put simply, replication involves re-testing a hypothesis to corroborate a scientific result (Schmidt, Reference Schmidt2009). Replication functions as the final arbiter of scientific knowledge – forcing scientists to refine (or discard) flawed theories that cannot precisely predict the outcome of successive experiments (Francis, Reference Francis2012).
Replication is especially important in clinical sciences, where failure to reproduce scientific results can lead to the dissemination of ineffective clinical practices (Prasad et al., Reference Prasad, Vandross, Toomey, Cheung, Rho, Quinn, Chacko, Borkar, Gall, Selvaraj, Ho and Cifu2013). In both medicine and psychiatry, reproducibility in clinical science has been investigated (Ioannidis, Reference Ioannidis2005; Tajika et al., Reference Tajika, Ogawa, Takeshima, Hayasaka and Furukawa2015) but unfortunately, not in psychotherapy research (Tackett et al., Reference Tackett, Brandes, King and Markon2019).
Based on thousands of clinical trials and hundreds of reviews of those trials, it is incontrovertible that psychotherapy is an effective intervention across a wide range of mental health problems (Lambert, Reference Lambert2013; Wampold and Imel, Reference Wampold and Imel2015; Munder et al., Reference Munder, Fluckiger, Leichsenring, Abbass, Hilsenroth, Luyten, Rabung, Steinert and Wampold2019; Cuijpers et al., Reference Cuijpers, Karyotaki, Reijnders and Ebert2019a), and replications to address the question of absolute efficacy would merely provide redundant information. However, there are ongoing questions about the superiority of particular psychotherapies (Wampold, Reference Wampold, Norcross, Beutler and Levant2005; Tolin, Reference Tolin2014; Wampold et al., Reference Wampold, Flückiger, Del Re, Yulish, Frost, Pace, Goldberg, Miller, Baardseth, Laska and Hilsenroth2017; Cuijpers et al., Reference Cuijpers, Reijnders and Huibers2019b).
There is little scientific consensus about what constitutes a meaningful superiority result in psychotherapy research. Occasionally, however, a single psychotherapy comparison or a few of these comparisons produce large effects (Cohen, Reference Cohen1988). These trials can influence meta-analytic conclusions regarding treatment superiority, and in turn, treatment guidelines (Cottraux et al., Reference Cottraux, Note, Albuisson, Yao, Note, Mollard, Bonasse, Jalenques, Guérin and Coudert2000; Clark et al., Reference Clark, Ehlers, Hackmann, McManus, Fennell, Grey, Waddington and Wild2006; Wampold et al., Reference Wampold, Flückiger, Del Re, Yulish, Frost, Pace, Goldberg, Miller, Baardseth, Laska and Hilsenroth2017). The replicability of these trials is important because large effects might well be false-positive results.
The purpose of the present study was to examine the replications of psychotherapy trials demonstrating large treatment differences. We focused on trials for depression and anxiety-related disorders because they are among the most prevalent (Kessler et al., Reference Kessler, Berglund, Demler, Jin, Merikangas and Walters2005a, Reference Kessler, Demler, Frank, Olfson, Pincus, Walters, Wang, Wells and Zaslavsky2005b), and many psychotherapies for these disorders are classified as having strong empirical support (see https://www.div12.org/psychological-treatments/frequently-asked-questions/).
Ingredients of replication
Schmidt (Reference Schmidt2009) distinguished two broad categories of scientific replication: direct and conceptual. Generally, a direct replication involves procedural duplication of an earlier experiment, whereas a conceptual replication involves testing the central hypothesis of the earlier study through alternative experimental arrangements (Schmidt, Reference Schmidt2009). As the ‘gold standard’ of experimental designs, randomised clinical trials (RCTs) are well suited for direct or conceptual replication, but a direct replication offers the strongest corroboratory evidence because it most closely resembles the original trial (Schmidt, Reference Schmidt2009). Alternatively, conceptual replication may well elucidate other aspects of the phenomenon, such as generalizability.
The experimental logic of a clinical trial is standard across scientific disciplines. In the case of psychotherapy comparisons, a sample of patients suffering from a particular disorder are recruited, randomly assigned to two or more treatment conditions, receive the respective treatments, and are evaluated at termination to compare outcomes between treatment groups (Heppner et al., Reference Heppner, Kivlighan and Wampold2008). Better outcomes for the patients in one treatment condition vis-à-vis the outcomes for the other treatment, beyond what is expected by chance, are evidence for the superiority of that treatment (Heppner et al., Reference Heppner, Kivlighan and Wampold2008).
Despite this sound logic, inferences regarding treatment superiority (i.e. Treatment A is superior to B) in psychotherapy trials are more complicated. Blinding is not possible in psychotherapy, and researcher allegiance to a particular treatment can influence outcomes in direct comparisons (McLeod, Reference McLeod2009; Munder et al., Reference Munder, Gerger, Trelle and Barth2011, Reference Munder, Brütsch, Leonhart, Gerger and Barth2013). Moreover, unlike medicine, psychotherapies cannot be completely standardised across patients or administrations. In psychotherapy, the therapist delivering the treatment, regardless of the degree to which the treatment is standardised, can make a difference; that is, therapists are not interchangeable (Baldwin and Imel, Reference Baldwin, Imel and Lambert2013). Consequently, therapist characteristics (training, experience, etc.) and representativeness should be considerations for replication.
Despite these complexities, enough methodological features can be controlled to discriminate direct and conceptual replication of a psychotherapy trial. In a direct replication, nearly all trial features must be the same between the original trial and replication attempt (denoted by ‘ + ’ in Table 1). These features broadly include duplication of treatment delivery, therapist expertise, characteristics of the patient sample and outcome assessment.
‘+’ = This is a feature that must remain the same in the primary trial and replication; ‘−’ = this is a feature that can be different in the replication.
Some trial features, however, can be altered so that clinical trial could qualify as a conceptual replication when the researcher is interested in examining parametric effects or generalizability across contexts. The number of psychotherapy sessions delivered (treatment dose), format of treatment (i.e. individual or group; see Table 1) and context (e.g. university speciality clinic v. community clinic) are examples of features that may vary between the primary trial and replication attempt in these situations. A characteristic of a conceptual replication is that specific features of the original study are intentionally varied to examine a particular conjecture. A case could be made, however, that to accept any superiority result as an established scientific finding, at least one direct replication is required (Schmidt, Reference Schmidt2009).
Method
This replication investigation employed the following steps. First, meta-analyses of comparative psychotherapy treatments for depression and anxiety were identified. Second, from those meta-analyses, trials with large effects (d > 0.8) were culled from the meta-analyses. Third, all citations to the identified trials were examined to determine whether they qualified as direct or conceptual replications according to the components listed in Table 1. These processes are different from canonical reviews and therefore various standards (e.g. PRISMA) were not applicable. Although these procedures were determined a priori, the research was not registered.
Identification of meta-analyses
Meta-analyses were only included in this study if they met the following criteria: (a) they included RCTs that directly compared at least two active psychosocial treatments, (b) patients were adults who suffered from a diagnosable anxiety or depressive psychiatric condition as defined by any edition of the Diagnostic and Statistical Manual of Mental Disorders or other accepted diagnostic nosologies, and (c) they were published in a peer-reviewed journal.
The search for meta-analytic reviews included an exhaustive search of the database PsycINFO using the search engine ProQuest by two of the authors (BLIND). Primary search terms included ‘Depression’, ‘Anxiety’ and ‘Meta-analysis’. Filtering options for adult population and peer-reviewed journal were applied to limit search results. Potential meta-analyses were evaluated by all the authors to determine whether they met the eligibility criteria until full consensus was reached.
Large effect trials
A clinical trial database was developed that included all the comparative trials contained in the meta-analyses. Some meta-analyses included only comparative trials; for other meta-analyses, comparative trials formed a subset of trials. In the latter situation, only the comparative trials were included in this study. Some trials were contained in more than one meta-analysis and these duplicate trials were noted and then excluded so that a trial only appeared once in the trial database.
Effect size conventions (Cohen, Reference Cohen1988) stipulate that small effect sizes include those in the range of 0.2, medium effect sizes are in the range of 0.5 and large effects are in the range of 0.8. To identify the largest effect trials from our database, we used a cut-off value of d = 0.80. Trials that exhibited absolute effect sizes equal to or greater than this cut-off were included in a new dataset that only contained large effect trials. Only targeted outcomes that measured diagnosis-specific symptomology were considered in this investigation, as reported in the published meta-analysis or supplied from the meta-analytic authors upon our request in two cases (Tolin, Reference Tolin2010; Cuijpers et al., Reference Cuijpers, Ebert, Acarturk, Andersson and Cristea2016).
Finally, trials were excluded that were not ‘bona-fide’ psychotherapy comparisons (Wampold et al., Reference Wampold, Mondin, Moody, Stich, Benson and Ahn1997). The bona-fide criteria were interpreted liberally; we only eliminated comparisons to treatment-as-usual conditions and comparisons to ‘non-active’ control conditions, as identified by the trials' authors. The large effect trials were graded on their overall methodological quality by two independent raters (two of the authors) using the Psychotherapy Quality Rating Scale (PQRS) (Kocsis et al., Reference Kocsis, Gerber, Milrod, Roose, Barber, Thase, Perkins and Leon2010). This scale was developed for RCTs of psychotherapy and has demonstrated good internal consistency in validation studies (α = 0.87). The scale includes 24 items that range from 0 to 2, examining six trial quality domains: definition and delivery of treatment, patient description, outcome assessment, appropriateness of data analysis, treatment assignment and overall quality. The final item ranges from 1 to 7, where a score of 1 indicates an exceptionally poor trial and 7 indicates an exceptionally good trial. Thus, the PQRS ranges from 1 to 55, with higher scores indicating a higher quality trial.
Replication searches
The first step in locating replications was based on the premise that a replication would cite the trial it was replicating. Accordingly, we used the Web of Science to search all citations of each large effect trial in our database. The number of citations for each trial is shown in Table 3. Presumably, a replication would use the word replication in the title or abstract, which we coded. However, partly because only one citation to any of the trials used the term replication in this way, we did not limit the identification of replications to this standard. Rather, each study was coded by two independent raters (two authors) where we applied the criteria in Table 1 to determine if each citation was a replication of a previous trial. If either rater indicated that a trial was a possible replication, we included that trial in our results.
After completing this process, we were concerned that we may have missed replication attempts of trials in our database. Thus, in a post-hoc search, we used two key words (Psychotherapy, Replication) to search the Web of Science and PsycINFO databases for any replication attempt of a psychotherapy clinical trial as identified by its title. Results were filtered only by ‘clinical trial methodology’ from the list of methodologies. An advantage of this secondary search was that it allowed us to assess the extent of replication efforts in the psychotherapy literature writ large.
Results
Meta-analyses
Search results for the meta-analyses returned 648 ‘hits’ from PsycINFO. In total, 638 meta-analyses were excluded for the lack of direct comparisons, the inclusion of non-RCT research designs or the lack of patient samples with a defined disorder (see Fig. 1). Ultimately, ten meta-analyses met all inclusion criteria and composed the meta-analysis database (Table 2).
CBT, cognitive behavioural therapy; PDT, psychodynamic therapy; NDST, non-directive supportive therapy; IPT, interpersonal therapy; EDT, experiential dynamic therapy.
Note: all effect sizes are reported for targeted symptom measures.
Of the ten meta-analyses, four examined populations suffering from depression, five focused on both depression and anxiety-related disorders and one examined anxiety. Aggregate effect size estimates ranged from 0.02 to 0.46 in these meta-analyses, and five of the ten reported finding evidence for treatment superiority (see Table 2). An important caveat relates to Tolin's (Reference Tolin2014) meta-analysis which found cognitive-behaviour therapy (CBT) superior to other psychotherapies for anxiety disorders as this conclusion was corrected in a subsequent corrigendum due to a calculation error contained in the earlier analysis (n.b., we used the corrected effect sizes in this study) (Tolin, Reference Tolin2015).
Clinical trials
Clinical trials drawn from these meta-analyses spanned the years 1972–2016, totalling 137 unique trials with 157 treatment comparisons – involving CBT, dynamic therapies, behavioural activation, cognitive therapy (CT), interpersonal therapy and emotion-focused therapies. Effect sizes from all trials ranged from d = 0.0 to d = 1.56 on targeted outcomes, with a mean average effect of d = 0.41 (s.d. = 0.32). Approximately 57% (k = 88) of the comparisons yielded small-to-medium effects, 34% (k = 56) yielded medium-to-large effects, and 9% (k = 14) of the comparisons exhibited large effects greater than or equal to the d = 0.8 cut-off used in this study. These large effect sizes came from 12 distinct clinical trials (see Table 3). It is important to note that five of the large effect trails appeared in more than one of our meta-analyses.
CBT, cognitive behaviour therapy; CT, cognitive therapy; DYN, dynamic therapy; REM, reminiscence psychotherapy; EFT, emotion-focused therapy; TRN, transdiagnostic therapy; SUP, supportive therapy; BAT, behavioural activation therapy; RLX, relaxation therapy; GFT, goal-focused therapy; CT, cognitive therapy; AR, applied relaxation, quality ratings were averaged between two independent raters; GAD, generalised anxiety disorder; PD, panic disorder.
a Manual not available.
*Trial unavailable in English. Quality = item 25 from the PQRS.
Large effect trials
Treatment comparisons from the large effect trials included CBT, interpersonal therapy (IPT), psychodynamic psychotherapies (DYN), transdiagnostic therapies (TRN), manualised supportive psychotherapies (SUP) and behavioural activation (BAT). The average number of patients per treatment condition was 21.8 and the average number of psychotherapy sessions delivered for each treatment condition was 13.6. There were no systematic differences between superior and inferior treatments in terms of sample size or treatment dose in these trials. Effect sizes at post-treatment on targeted outcomes (e.g. Beck Depression Inventory (BDI) for depression) ranged from d = 0.8 to d = 1.56, with a mean effect of d = 1.1 (s.d. = 0.25).
The index of rater agreement for trial quality ratings was good (ICC = 0.83) between the two raters (Borenstein et al., Reference Borenstein, Hedges, Higgins and Rothstein2009), and the final PQRS scores were the average of the two (Table 3). The ratings on the PQRS ranged from 16 to 43 out of the possible score of 55. No large effect trials were rated as exceptionally good but two were rated as very good; over half were rated as exceptionally poor, very poor or moderately poor. Treatment comparisons in the lowest quality trials found CBT superior to behavioural activation (d = 0.89) for depressed adults (Taylor and Marshall, Reference Taylor and Marshall1977); CT superior to behaviour therapy and supportive therapy (d = 1.13; d = 0.90, respectively) for depressed college students (Shaw, Reference Shaw1977); and goal-focused group psychotherapy superior to group reminiscence therapy (d = 1.14) for late-life depressed adults (Klausner et al., Reference Klausner, Clarkin, Spielman, Pupo, Abrams and Alexopoulos1998). The highest quality trials found dynamic therapy superior to applied relaxation (d = 0.89) for adults with panic disorder (Milrod et al., Reference Milrod, Leon, Busch, Rudden, Schwalberg, Clarkin, Aronson, Singer, Turchin, Klass, Graf, Teres and Shear2007); and CBT superior to dynamic therapy (d = 1.44) (Shapiro et al., Reference Shapiro, Barkham, Rees, Hardy, Reynolds and Startup1994). Trial characteristics that attenuated quality ratings included poor definition or execution of treatment delivery, and unclear description of patients and diagnostic classification.
Replication search results
The large effect trials were cited 1035 times in the Web of Science (see Table 3), with an average number of 85 citations per trial. The least cited trial was (Taylor and Marshall, Reference Taylor and Marshall1977) not cited at all in the Web of Science and the most (Shapiro et al., Reference Shapiro, Barkham, Rees, Hardy, Reynolds and Startup1994) was cited 268 times. Of all the recorded citations, 126 (12%) were clinical trials.
Applying the criteria from Table 1 to each citing clinical trial, there were no trials that could be deemed a direct replication attempt in terms of their methodology. That is, no trials methodologically duplicated an original trial by comparing the same treatments (dose, format, manual), on the same population (diagnosis, patient characteristics), with therapists of the same training and experience, using an approximately similar measure of clinical outcome. In fact, only 15 of all citing trials screened included at least one general treatment from the original trial in an active comparison (Table 3). The absence of identical treatment comparisons made these trials only eligible as conceptual replication attempts (see online Supplementary Table S4).
From the 15 trials that could possibly be considered conceptual replications, treatment format and dose varied from the influential trial, which is acceptable in a conceptual replication whose purpose is generalizability. However, only six of these possible replications (Wilson et al., Reference Wilson, Goldin and Charbonneau-Powis1983; Barkham et al., Reference Barkham, Rees, Shapiro, Stiles, Agnew, Halstead, Culverwell and Harrington1996; Barkham et al., Reference Barkham, Shapiro, Hardy and Rees1999; Mohr et al., Reference Mohr, Boudewyn, Goodkin, Bostrom and Epstein2001; Arntz, Reference Arntz2003; Gallagher-Thompson et al., Reference Gallagher-Thompson, Coon, Solano, Ambler, Rabinowitz and Thompson2003) used the same treatment manual as the original trial. The number of potential replication attempts was further reduced, because only four of these trials included an approximately similar diagnostic population of patients (Wilson et al., Reference Wilson, Goldin and Charbonneau-Powis1983; Barkham et al., Reference Barkham, Rees, Shapiro, Stiles, Agnew, Halstead, Culverwell and Harrington1996; Mohr et al., Reference Mohr, Boudewyn, Goodkin, Bostrom and Epstein2001; Arntz, Reference Arntz2003). From the remaining four trials, only one of them was explicitly identified as a replication attempt by its authors (Barkham et al., Reference Barkham, Rees, Shapiro, Stiles, Agnew, Halstead, Culverwell and Harrington1996). Characteristics of the replication attempts can be seen in online Supplementary Table S4.
Are any of these trials ‘successful’ replication attempts? Conclusions from two of the four quasi-valid conceptual replication attempts loosely corroborated the original trial. Mohr et al. (Reference Mohr, Boudewyn, Goodkin, Bostrom and Epstein2001) compared 16 weeks of CBT to dynamic therapy and found a significant difference at post-treatment on the BDI in favour of CBT for depressed patients with multiple sclerosis. But this effect was notably smaller (d = 0.55 v. d = 1.27) than the post-treatment difference found in the original trial. Barkham et al. (Reference Barkham, Rees, Shapiro, Stiles, Agnew, Halstead, Culverwell and Harrington1996) intentionally replicated Shapiro et al.'s (Reference Shapiro, Barkham, Rees, Hardy, Reynolds and Startup1994) trial and found general equivalence between CBT and dynamic therapy at post-treatment; however, significant differences favouring CBT observed on the BDI (d = 1.44)Footnote 1 from the original trial were not replicated. The remaining two replication attempts reached divergent conclusions than the original trial they cited. For example, in direct contrast to Durham et al. (Reference Durham, Murphy, Allan, Richard, Treliving and Fenton1994), Arntz (Reference Arntz2003) found applied relaxation and CT to be equivalent after 12 sessions of treatment on composite measures of a generalized anxiety disorder (d = 0.14). Finally, Wilson et al. (Reference Wilson, Goldin and Charbonneau-Powis1983) found small differences (d = 0.25) between CT and behavioural treatment following eight sessions of counselling, whereas Shaw (Reference Shaw1977) found CT superior to behaviour therapy (d = 1.13). As a whole, these conceptual replications produced weaker or contradictory results compared to the original trial they cited.
Results from the general replication search resulted in 38 hits from our search terms Replication and Psychotherapy. Examining each hit revealed no direct or conceptual replication attempts of a psychotherapy comparison for adult depression or anxiety. More specifically, six studies were excluded because they examined a child population, ten did not include direct psychotherapy comparisons, and four were not clinical trials at all. The remaining 18 studies were either narrative reviews, comparisons to treatment-as-usual conditions or waitlists. One study was identified as a replication attempt, but not of a comparative psychotherapy result (Chambless et al., Reference Chambless, Allred, Chen, McCarthy, Milrod and Barber2017).
Discussion
We examined replication attempts of trials that exhibited large effects in published meta-analyses. The essential finding is that replications are scarce: There were no direct replication attempts, and only a few possible conceptual replication attempts, many of which failed to corroborate the conclusions of the original trial. Several large effect trials had poor quality ratings. However, even for the two trials rated as ‘very good’ quality (Shapiro et al., Reference Shapiro, Barkham, Rees, Hardy, Reynolds and Startup1994; Milrod et al., Reference Milrod, Leon, Busch, Rudden, Schwalberg, Clarkin, Aronson, Singer, Turchin, Klass, Graf, Teres and Shear2007), there were no conceptual replication attempts for one of these trials (Milrod et al., Reference Milrod, Leon, Busch, Rudden, Schwalberg, Clarkin, Aronson, Singer, Turchin, Klass, Graf, Teres and Shear2007). For the other ‘very good’ trial (Shapiro et al., Reference Shapiro, Barkham, Rees, Hardy, Reynolds and Startup1994), there were two possible conceptual replication attempts. Unfortunately, one attempt used a sub-clinical population, and the other did not corroborate the original result, particularly the large effect favouring CBT vis-à-vis dynamic therapy (Barkham et al., Reference Barkham, Rees, Shapiro, Stiles, Agnew, Halstead, Culverwell and Harrington1996; Barkham et al., Reference Barkham, Shapiro, Hardy and Rees1999).
The non-replication of trials producing large effects is problematic for meta-analytic conclusions. Several such trials appeared in multiple meta-analyses from our database. Meta-analyses with more than two large effect trials reported the strongest evidence for treatment superiority. For example, Tolin (Reference Tolin2010) found CBT superior to other treatments at post-treatment on measures of depression or anxiety (d = 0.22) and this analysis included four trials with large effects. Similarly, Cuijpers et al. (Reference Cuijpers, Ebert, Acarturk, Andersson and Cristea2016) reported evidence for the superiority of CBT (d = 0.29) on measures of depression at post-treatment and included four trials with large effects. By contrast, the meta-analyses in our database with only one or no large effect trials reported more modest findings, even for the same class of treatments (e.g. CBT) (Cuijpers et al., Reference Cuijpers, van Straten and Smit2006, Reference Cuijpers, van Straten, Schuurmans, van Oppen, Hollon and Andersson2010b).
Before we discuss broader implications of these findings, we examine the most valid replication attempt from our results. The original trial known as the Second Sheffield Psychotherapy Project (SPP2), conducted by Shapiro et al. (Reference Shapiro, Barkham, Rees, Hardy, Reynolds and Startup1994), compared cognitive-behavioural psychotherapy to psychodynamic-interpersonal psychotherapy with two different durations (eight or 16 sessions) for the treatment of depression. In this study, the patients were treated in a research clinic at the University Sheffield.
In 1996, Barkham et al. (Reference Barkham, Rees, Shapiro, Stiles, Agnew, Halstead, Culverwell and Harrington1996) sought to conceptually replicate the results of SPP2 in an applied setting as opposed to the special arrangements that existed in a speciality research clinic in a university setting, intentionally varying one aspect of the experimental arrangement. Thus, the intention and procedure align with the idea of a conceptual replication (Schmidt, Reference Schmidt2009). The title of the Barkham trial clearly denoted their intention: ‘Outcomes of time-limited psychotherapy in applied settings: Replicating the Second Sheffield Psychotherapy Project’ (emphasis added). Moreover, Barkham et al. explicitly compared the results of the influential study and the replication, as summarised in online Supplementary Table S5. As the only trial that closely hewed to the idea of a conceptual replication in our investigation, it could serve as a model for researchers.
Recent reviews have expressed the need to increase the focus on replication in clinical social sciences (Tackett et al., Reference Tackett, Brandes, King and Markon2019). We add to these efforts by highlighting how non-replicated findings influence superiority conclusions in psychotherapy. The trials with large effects in our review tended to have small samples, thus it cannot be ruled out that their effects were due to chance. At the same time, our unrestricted search suggests that replication attempts are rare regardless of trial effect or sample size. A recent investigation (Sakaluk et al., Reference Sakaluk, Williams, Kilshaw and Rhyner2019) supports this conclusion, reporting ‘weak’ evidence for replicability across empirically supported psychotherapies. Unfortunately, a limitation of this study was its classification system, which grouped heterogeneous treatments and treatment comparisons.
Moving forward, replication should be a collective priority in psychotherapy research. A challenge is trial quality. High-quality trials appear to be the exception, not the rule in psychotherapy research (Cuijpers et al., Reference Cuijpers, van Straten, Bohlmeijer, Hollon and Andersson2010a). The case could be made that methodologically flawed trials should not be replicated. From this perspective, the dearth of replication for poor trials is expected, even desirable. However, if a low-quality trial is unworthy of a replication attempt, should it be admissible as scientific evidence in the first place? Clearly, low-quality trials should not serve as scientific evidence in a meta-analysis, and at the same time be exempt from replication standards. For example, Shaw (Reference Shaw1977) and Shear et al. (Reference Shear, Houck, Greeno and Masters2001) were two of the low-quality trials in our review, yet they appeared in multiple meta-analyses from our database (see Table 3). Unrepresentative therapists (one therapy, Shaw) and an unavailable treatment manual [Shear, personal communication, 7 March 2015] make these trials impossible to replicate.
Aside from trail quality, other challenges exist. Failure to replicate findings is often attributed to the lack of adherence to the treatment protocol (Laska et al., Reference Laska, Gurman and Wampold2014). For example, when advocates of a particular treatment are confronted with a trial that fails to support the treatment, they often claim that the treatment was not given with adequate adherence to the manual (Laska et al., Reference Laska, Gurman and Wampold2014; Wampold and Imel, Reference Wampold and Imel2015), an argument that attributes importance to the precise nature of treatments. According to this view – and one that we adopted for this investigation – the particular manual used to guide treatment delivery in a given trial must also be used in its replication. Some incremental steps towards addressing these challenges include (1) the development of authoritative replication criteria for clinical trials, and (2) stipulating quality benchmarks to aid research decisions to replicate trials.
The results of our investigation should be interpreted with caution. The replication criteria developed in Table 1 are the most logically consistent with the concepts of direct or conceptual replication in science writ large, but these criteria are not universally accepted for psychotherapy trials. In addition, our cut-off threshold (d = 0.80) for large effect trials is arbitrary, but by the same token, any cut-off value would be similarly arbitrary. Nevertheless, the conclusions that there are few replications of psychotherapy comparative studies seem robust (i.e. is not dependent on the cut-off chosen).
In conclusion, attempts to identify the most effective treatment for a particular disorder have not been successfully reproduced among trials showing the staunchest evidence. It seems wise to further investigate the replicability of psychotherapy trials. Especially if we are to have confidence in treatment guidelines premised on the assumption of treatment superiority.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S2045796020000402
Data
Clinical trial and meta-analytic databases for this review can be found at https://data.mendeley.com/drafts/.
Acknowledgements
None.
Financial support
None.
Conflict of interest
The authors have no conflicts of interest to declare.