To the editor
We read with great interest the article by Ying et al. (Ying et al. Reference Ying, Ji, Kong, Wang, Chen, Wang and Ruan2022), who reported on a well-conducted randomized controlled trial of cognitive-behavioral therapy (CBT) in alleviating depressive symptoms among Chinese patients with subthreshold depression. The results indicated that the internet-based CBT (ICBT) was significantly superior not only to the waiting list but also to face-to-face CBT. In interpreting the results from clinical trials, effect sizes are critically informative. However, we have some concerns about the effect sizes reported in this study.
In general, the standardized mean difference (s.m.d.) is widely used in clinical trials when the outcomes are continuous. The s.m.d. is standardized by dividing the mean difference (m.d.) by the standard deviation (s.d.), and allows comparison between studies which use different measuring instruments. However, there are various methods to calculate the s.m.d.: the s.m.d. can be calculated from different m.d.s (e.g. m.d. of endpoint scores, m.d. of change scores from baseline, m.d. from a model where the baseline score is adjusted) and s.d.s (e.g. pooled s.d. of endpoint scores, pooled s.d. of change scores, pooled s.d. of baseline scores, or the s.d. converted from model statistics). The s.m.d.s estimated by these methods are substantially different from one another, raising potential problems regarding reproducibility, selective reporting, and proper interpretation of how large the effect is (Luo et al., Reference Luo, Funada, Yoshida, Noma, Sahker and Furukawa2022). For instance, without prespecifying the calculation method, researchers may compute s.m.d.s using different methods, and select the largest one to report. Additionally, Cohen's rule of thumb, often used as a reference to interpret the effect size in clinical research, could be hard to apply if different calculation methods produce different s.m.d. values for the same outcome of the study.
In Ying et al.'s study, the method to calculate the effect size was explained in the Method section as, ‘Within-group and between-group effect size (Cohen's d) were based on the method suggested for mixed model analysis (Feingold, Reference Feingold2009; Morris & DeShon, Reference Morris and DeShon2002; Thorsell et al., Reference Thorsell, Finnes, Dahl, Lundgren, Gybrant, Gordh and Buhrman2011)’. However, it is still unclear which m.d. and which s.d. they used to calculate the effect sizes. Both Feingold's and Morris & DeShon's papers suggest using the pooled baseline s.d. for the m.d. estimated from a mixed model (Feingold, Reference Feingold2009; Morris & DeShon, Reference Morris and DeShon2002). On the other hand, in Thorsell et al.'s study, the square root of the variance estimate from the mixed model was used to calculate the effect size (Thorsell et al., Reference Thorsell, Finnes, Dahl, Lundgren, Gybrant, Gordh and Buhrman2011).
Ying et al.'s reported s.m.d.s do not seem to follow these methods. Let's take for example the between-group effect size of CES-D at post-intervention for ICBT v. face-to-face CBT, which was reported to be 0.06 (95% confidence interval: 0.02–0.09) in their Table 4. Using the values reported in Table 3 (the m.d. at endpoint was 1.6, the pooled baseline s.d. was 3.75, and the pooled endpoint s.d. was 3.80), the s.m.d. would be calculated as 0.43 using the baseline s.d. or 0.42 using the endpoint s.d.. The reported s.m.d. could have been calculated from other s.d.s that were not reported in the paper, but for an m.d. of 1.6 to generate an s.m.d. of 0.06, the s.d. would need to be approximately 27. It would be very difficult to imagine a population that has such a large variability in CES-D scores whose score ranges between 0–60. There are similar discrepancies for the other between-group effect sizes reported in their Table 4.
Because the s.m.d. values that we calculated and that was reported by the authors were very different, the interpretation of how large the effect of the intervention was could have substantively different clinical interpretations. The authors stated in the article that Cohen's rule of thumb was used as an aid for interpretation. Applying this rule, an s.m.d. of 0.4 would be moderate, while an s.m.d. of 0.06 would be less than a small effect. We are wondering how the effect of ICBT and CBT should be properly interpreted, and whether the effect sizes estimated in this article could be appropriately compared to previous studies, which might have used different s.m.d. calculation methods.
In summary, because the s.m.d.s can be calculated by different m.d.s and s.d.s and these s.m.d. estimates can vary substantially, researchers should be careful in reporting them and readers should be mindful how the reported s.m.d.s were calculated. As it is still hard to recommend a single calculation method that should be used universally for now (Luo et al., Reference Luo, Funada, Yoshida, Noma, Sahker and Furukawa2022), it is desirable for researchers to report their calculation methods in detail to increase transparency and reproducibility.
Prespecifying the method beforehand may help to avoid selective reporting bias. Meanwhile, future methodological studies are warranted that elucidate which s.m.d. calculation methods are recommendable.
Conflict of interest
None.