Historical trends in reporting effect sizes in clinical neuropsychology journals: A call to venture beyond the results section

Steven Paul Woods; Andrea Mustafa; Ilex Beltran-Najera; Anastasia Matchanova; Jennifer L. Thompson; Natalie C. Ridgely

doi:10.1017/S1355617723000127

Historical trends in reporting effect sizes in clinical neuropsychology journals: A call to venture beyond the results section

Published online by Cambridge University Press: 10 February 2023

Steven Paul Woods

Andrea Mustafa

Ilex Beltran-Najera ,

Anastasia Matchanova ,

Jennifer L. Thompson and

Natalie C. Ridgely

Show author details

Steven Paul Woods*: Affiliation:
Department of Psychology, University of Houston, Houston, TX 77204, USA
Andrea Mustafa: Affiliation:
Department of Psychology, University of Houston, Houston, TX 77204, USA
Ilex Beltran-Najera: Affiliation:
Department of Psychology, University of Houston, Houston, TX 77204, USA
Anastasia Matchanova: Affiliation:
Department of Psychology, University of Houston, Houston, TX 77204, USA
Jennifer L. Thompson: Affiliation:
Department of Psychology, University of Houston, Houston, TX 77204, USA
Natalie C. Ridgely: Affiliation:
Department of Psychology, University of Houston, Houston, TX 77204, USA
*: Corresponding author: Steven Paul Woods, email: [email protected]

Article contents

Abstract
Objective:
Methods:
Results:
Conclusions:
Introduction
Methods
Article sampling and coding procedure
Data analysis
Results
Discussion
Supplementary material
Funding statement
Conflicts of interest
References

Rights & Permissions

Abstract

Objective:

For decades, quantitative psychologists have recommended that authors report effect sizes to convey the magnitude and potential clinical relevance of statistical associations. However, fewer than one-third of neuropsychology articles published in the early 2000s reported effect sizes. This study re-examines the frequency and extent of effect size reporting in neuropsychology journal articles by manuscript section and over time.

Methods:

A sample of 326 empirical articles were drawn from 36 randomly selected issues of six neuropsychology journals at 5-year intervals between 1995 and 2020. Four raters used a novel, reliable coding system to quantify the extent to which effect sizes were included in the major sections of all 326 articles.

Results:

Findings showed medium-to-large increases in effect size reporting in the Methods and Results sections of neuropsychology journal articles that plateaued in recent years; however, there were only very small and nonsignificant changes in effect size reporting in the Abstract, Introduction, and Discussion sections.

Conclusions:

Authors in neuropsychology journals have markedly improved their effect size reporting in the core Methods and Results sections, but are still unlikely to consider these valuable metrics when motivating their study hypotheses and interpreting the conceptual and clinical implications of their findings. Recommendations are provided to encourage more widespread integration of effect sizes in neuropsychological research.

Keywords

Bibliometrics data interpretation behavioral sciences information science statistical techniques clinical neuropsychology

Type: Research Article
Information: Journal of the International Neuropsychological Society , Volume 29 , Issue 9 , November 2023 , pp. 885 - 892

DOI: https://doi.org/10.1017/S1355617723000127 [Opens in a new window]
Copyright: Copyright © INS. Published by Cambridge University Press, 2023

Introduction

Null hypothesis significance testing (NHST) is a nearly ubiquitous approach to statistical analysis that asks a simple question: “Given that H ₀ is true, what is the probability of these (or more extreme) data?” (Cohen, Reference Cohen1994). For decades, however, methodologists and statisticians in the psychological sciences have expressed frustration about the field’s dependence on NHST and p-values (Yates, Reference Yates1951). The dissatisfaction with NHST and p-values is well documented and includes critiques of their reliance on binary, arbitrary cut points to the determine “significance” of an analysis, the widespread misunderstanding of their underlying logic, and their inability to convey information about the size of an effect (Cohen, Reference Cohen1994). The discontent with NHST has sparked numerous recommendations for alternative (e.g., Bayesian) and complementary (e.g., visualization, confidence intervals) approaches to statistical analysis, including consideration of effect sizes (e.g., Cicchetti, Reference Cicchetti1998).

An effect size is a quantitative indicator of the magnitude of a statistical association; in other words, effect sizes provide the reader with information about the strength of a finding. Effect sizes come in a variety of forms, ranging from Pearson’s r values that describe the strength of association between two continuous variables to Cohen’s d values that describe the size of a score difference between two groups to more complex multivariable (e.g., partial eta-squared or R ²) and classification accuracy (e.g., odds ratios; Woods et al., Reference Woods, Weinborn and Lovejoy2003) metrics. Reporting effect sizes is important for a variety of practical and interpretive reasons (see Table 1). In reviewing the literature to motivate a study hypothesis, effect sizes help us to better understand inconsistencies across studies (e.g., meta-analysis; Demakis, Reference Demakis2006) and to estimate the potential clinical relevance of a particular finding. In planning a study, effect sizes directly inform power analyses that aid in determining the sample size required to detect a hypothesized association (Woods et al., Reference Woods, Rippeth, Conover, Carey, Parsons and Tröster2006). In selecting measures for a study, effect sizes inform us about the extent to which various tools show evidence of reliability and validity. In interpreting a result, effect sizes can guide our thinking about its consistency with the relevant literature and its theoretical and clinical importance. For example, it is usually not sufficient to know that a neuropsychological test score is “significantly different” between two samples; rather, its clinical value will depend more so on the magnitude of that group difference (e.g., Cohen’s d, percentage sample overlap; Zakzanis, Reference Zakzanis2001) and its classification accuracy at various cut points (e.g., Bieliauskas et al.,Reference Bieliauskas, Fastenau, Lacy and Roper1997).

Table 1. Examples of the ways in which effect sizes can be integrated into different sections of a standard article

Note. The American Psychological Association’s Journal Article Reporting Standards (Applebaum et al., Reference Appelbaum, Cooper, Kline, Mayo-Wilson, Nezu and Rao2018) informed the content of this effect size table and quotes were adapted from studies published by our group.

Quantitative psychologists have been calling for the reporting and integration of effect sizes for nearly 100 years (Fisher, Reference Fisher1925). In 1996, the American Psychological Association (APA) formally recommended that researchers report measures of effect size (see also Wilkinson & The Task Force on Statistical Inference (Reference Wilkinson1999)). Prior to the APA recommendations, effect size reporting was fairly uncommon in psychological research. Peng et al. (Reference Peng, Chen, Chiang and Chiang2013) estimated that approximately 29% (range = 0, 77%) of psychology journal articles included effect sizes in the modern era. Encouragingly, there has been a modest increase in effect size reporting frequency in psychology journal articles (range = 35, 60%) in the decades after the APA recommendations first appeared (e.g., Peng et al., Reference Peng, Chen, Chiang and Chiang2013). Nevertheless, there remains a wide range of compliance in the reporting of effect sizes both within and across journals and subdisciplines of psychology (Fritz et al., Reference Fritz and Scherndl2012; Kirk, Reference Kirk1996; Sun et al., Reference Sun, Pan and Wang2010).

Editors, authors, and reviewers for neuropsychology journals have historically shown awareness of the importance of effect sizes in facilitating our understanding of research on brain–behavior relationships. To our knowledge, the first overt mention of “effect sizes” in a neuropsychology journal article occurred in a 1989 neuroimaging study of reading by Hueuttner et al., (Reference Huettner, Rosenthal and Hynd1989). Even so, it is not uncommon to encounter effect sizes (e.g., correlations, classification accuracy) in reading articles published decades earlier, when neuropsychology was still in its infancy (Reitan, Reference Reitan1955). A more formal call for integration of effect sizes in neuropsychological research came in a 1998 editorial on NHST by Domenic Cicchetti in the Journal of Clinical and Experimental Neuropsychology (JCEN). Similarly inclined articles about NHST and effect sizes appeared in prominent neuropsychology journals at around the same time, including papers by Clark (Reference Clark1999), Donders (Reference Donders2000), Zakzanis (Reference Zakzanis2001), and Millis (Reference Millis2003).

Three studies published in the early 2000s told a sobering story about the frequency with which effect sizes were being reported in neuropsychology journals. Bezeau & Graves (Reference Bezeau and Graves2001) found that only 6 of 66 articles (9.1%) that were published in the 1998-1999 volumes of Neuropsychology, JCEN, and the Journal of the International Neuropsychological Society (JINS) reported effect sizes. Similarly, Schatz et al. (Reference Schatz, Jay, McComb and McLaughlin2005) found that only 4 of 62 articles (4.3%) published in Archives of Clinical Neuropsychology (ACN) from 1990–1992 reported effect sizes. Encouragingly, Schatz et al. observed a small-to-medium increase in effect size reporting frequency in ACN that corresponded to the release of the initial APA statistical guidelines: The frequency of effect size reporting in ACN was 22.4% in publication years 1996–2000 and rose to 29.4% for publication years 2001–2004. Finally, Woods et al. (Reference Woods, Weinborn and Lovejoy2003) observed that only 58 of 200 relevant articles (29%) published in ACN, JCEN, JINS, Neuropsychology, and The Clinical Neuropsychologist (TCN) in 2000-2001 included classification accuracy statistics. The frequency of classification accuracy statistics reported in neuropsychology journals was moderately lower than for similar studies published in clinical neurology journals (52%).

The current study extends the literature on effect size reporting in neuropsychology journals in three important ways. First, we examine whether the increases in effect size reporting in neuropsychology journals in the early 2000s (Schatz et al., Reference Schatz, Jay, McComb and McLaughlin2005) have been lost, maintained, or accelerated. Second, we extend the depth of measuring effect size reporting beyond binary indicators of their presence or absence. The interpretive limitation of such binary reporting indicators is that an article in which a single effect size is reported is given the same score as an article that accompanies every inference tests with effects sizes. A coding system that captures a broader range of effect size reporting and integration might therefore be informative (Kirk, Reference Kirk1996). Third, we extend the breadth of measuring effect size reporting in neuropsychology by examining their usage beyond the Results section of an article. Although it is appropriate to report effect sizes in the Results, their interpretive value extends well beyond that technical section of a manuscript (Applebaum et al., Reference Appelbaum, Cooper, Kline, Mayo-Wilson, Nezu and Rao2018; Table 1). Indeed, the spirit of effect size reporting is to help the reader understand the magnitude of an observed effect (APA, 1996). Yet it is our impression that effect size reporting often begins and ends in the Results section of neuropsychology journal articles. Therefore, this study aimed estimate the frequency and extent of effect size reporting across article sections in a sample drawn from six neuropsychology journals published between 1995 and 2020.

Methods

The study used publicly available data derived from published scientific articles and therefore did not require approval from a human subjects review board. The research was completed in compliance with the Helsinki Declaration.

Article effect size reporting coding system

We developed a novel coding system to quantify the frequency of effect size reporting and integration in scientific articles (see Supplementary materials). In brief, the effect size coding system provided guidelines for trained raters to assign values of 0 to 3 for each major section of a scientific article (i.e., Abstract, Introduction, Methods, Results, and Discussion). Higher values indicate greater frequency and use of effect sizes in that section of the article. Examples of the qualities of sections at each rating level are displayed below:

1. 0 = Absent (i.e., no reporting or mention of effect sizes)
2. 1 = Minimal (e.g., A brief mention of the magnitude of an effect in the Introduction or Discussion; A few reliability values in the Methods; or a few correlation coefficients in the Results or Abstract)
3. = Moderate (e.g., Effect sizes are mentioned multiple times or given some interpretive weight in an Introduction or Discussion; Several reliability values in the Methods; or several NHST analyses in the Results or Abstract are accompanied by estimates of effect size)
4. = Extensive (e.g., A primary focus of the motivation for the study and interpretation of the findings is rooted in effect sizes; Extensive reporting of reliability and validity effect sizes in the Methods; A majority of the NHST analyses are accompanied by an estimate of effect size in the Results or Abstract)

We adopted an inclusive approach to defining effect sizes because there is variability in design, measurement, and analysis across neuropsychological studies. Thus we included a broad array of statistical values that allow readers to estimate the magnitude of: (1) an association between variables (e.g., Pearson’s r, Spearman’s rho, beta coefficients); (2) the amount of variance explained in a regression or related procedure (e.g., R ², delta R ²); (3) an analysis of variance or related procedure (e.g., Cohen’s d or η _p ²); (4) an association between categorical variables (e.g., Cramer’s V); and (5) classification accuracy (e.g., sensitivity, odds ratios). In all cases, the actual value of the effect size of interest must have been reported to be coded as present (e.g., a simple p-value for a Pearson’s r analysis that did not include the actual r value reported was not considered to be present).

The ratings were generated by four individuals with graduate-level statistics training and master’s degrees in psychology. The four raters underwent an initial orientation with the senior author to discuss the scientific foundation of effect sizes and the proposed coding system. Next, the four raters used the coding system to independently score a single article selected from a 2021 issue of a major neuropsychology journal. The scores generated for this article were then discussed as a group to co-register the raters and further calibrate the coding system. All raters then independently scored 10 articles that were randomly selected from a 2021 issue across six widely read clinical neuropsychology journals (Sweet et al., Reference Sweet, Meyer, Nelson and Moberg2011), namely Archives of Clinical Neuropsychology (ACN), Child Neuropsychology (CN), Journal of Clinical and Experimental Neuropsychology (JCEN), Journal of the International Neuropsychological Society (JINS), Neuropsychology (NP), and The Clinical Neuropsychologist (TCN). The intraclass correlation coefficient (ICC) for these effect size ratings was 0.935 (95% confidence interval = 0.832, 0.982), suggesting excellent inter-rater reliability.

Article sampling and coding procedure

Google® random number generator was used to select one issue from ACN, CN, JCEN, JINS, NP, and TCN at 5-year intervals between 1995 and 2020. Thus, there were 36 journal issues that contained a total of 395 articles. We excluded 69 nonempirical articles, such as editorials, conference proceedings, errata, corrigenda, presidential addresses, case studies, qualitative literature reviews, and book reviews. The remaining 326 articles comprised the final sample and included papers from all six journals (ACN n = 50, CN n = 37, JCEN n = 52, JINS n = 65, NP n = 80, and TCN n = 42) for the years 1995 (n = 39), 2000 (n =66), 2005 (n = 57), 2010 (n = 58), 2015 (n = 59), and 2020 (n = 47).

Each of the four raters was randomly assigned nine of these 36 journal issues. The raters used the effect size coding system to generated scores of 0 to 3 for the Abstract, Introduction, Methods, Results, and Discussion sections of every eligible article in their assigned issues. A total effect size score was calculated by summing the section ratings for each article. The total effect size score had adequate internal consistency (Cronbach’s alpha = 0.734). The range of possible total effect size scores was 0 to 15, with higher values reflecting greater use of effect sizes (sample range = 0, 13). The senior author double-scored a randomly selected article from each of the 36 issues for quality assurance purposes; the modal value of changes to the total score was 0 (range = 0, 2).

Data analysis

A multivariate analysis of variance was conducted to examine possible differences in the continuous effect size ratings (i.e., the five section ratings, within-article variable) by publication year (i.e., a six-level categorical between-group variable). We also included journal (i.e., a six-level categorical between-group variable) and rater (i.e., a four-level categorical between-group variable) in this model. Given the large sample size and the exploratory nature of the analyses, the critical alpha was set at .01. Partial eta-squared (η _p ²) effect sizes were generated, with values of 0.01, 0.06, and 0.14 being interpreted as small, medium, and large, respectively (Cohen, Reference Cohen1988). A Bonferroni adjustment or Tukey–Kramer HSD was used for post hoc tests, which were accompanied by Cohen’s d estimates of effect size. Analyses were conducted in SPSS (version 26).

Results

The primary findings are displayed graphically in Figure 1. Results showed that the effect size ratings were not significantly related to Journal (F[1,318] = 2.1, p = .145, η _p ² = .007) or Rater (F[1,318] = 4.4, p = .037, η _p ² = .014) and the associated effect sizes were very small. More importantly, we observed a significant, medium within-factor effect of article section (F[3.4, 1091.9] = 32.3, p < .0001, η _p ² = .092). Bonferroni-adjusted paired-sample t-tests (alpha = .006) showed the highest effect size ratings in the Results (ps <.0001, median d = 1.51), followed by the Discussion (ps >.0001, median d = .67), and Methods (ps < .0001, median d = .34). The Abstract did not differ from the Introduction and the associated effect size was small (p = .015, d = .14). There was also a significant, medium-sized between-group effect of publication year (F[5,318] = 8.1, p < .001, η _p ² = .077). Tukey–Kramer HSD tests showed that effect size values for 1995 were lower than all other years (ps < .05, median d = .79), except for 2000 (p = .146, d = .46).

Figure 1. Line graph displaying the interaction between publication year and article section on the reporting and interpretation of effect sizes. The standard error of the mean (SEM) values for the effect size ratings were .03 (Abstract), .04 (Introduction), .04 (Method), .06 (Results), and .05 (Discussion).

Interpretation of these main effects was tempered by a significant, medium-sized interaction between article section and publication year (F[20,1248] = 3.4, p < .0001, η _p ² = .092). Bonferroni-adjusted independent-samples t-tests (alpha = .008) showed small and nonsignificant associations between publication year and the Abstract (p = .271, η _p ² = .020), Introduction (p = .341, η _p ² = .017), or Discussion (p = .061, η _p ² = .032) sections. However, there was a significant, medium-sized difference in the effect size ratings for the Methods section across publication years (p = .0002, η _p ² = .073), such that 1995 and 2000 were both lower than 2015 and 2020 (Tukey–Kramer HSD ps < .05, median d = .66). No other publication years differed in their effect size ratings for the Methods section (ps > .05, median d = .26). There was also a significant and large difference in the effects size ratings for the Results section across publication year (p < .0001, η _p ² = .144). Post hoc tests showed that effect size ratings for 1995 and 2000 were considerably lower than 2020, 2015, and 2010 (Tukey–Kramer HSD ps < .05, median d = .93). Publication year 1995 also differed from 2005 (p = .0003, d = .82). No other publication years differed in their effect size ratings for the Results section (ps>05, median d = .21).

Finally, we calculated simple frequency of effect size reporting in articles that were published in 2020. Figure 2 shows that 93.6% of articles published in 2020 included some indicator of effect size (operationalized as a rating >0). At the descriptive level, the highest frequency of effect size reporting was in the Results and Discussion sections (93.6%) and the lowest rate of reporting was in the Abstract (34%).

Figure 2. Bar graph showing the frequency with which any effect sizes were mentioned in different sections of 47 articles published across six neuropsychology journals in 2020.

Discussion

Effect sizes can help authors, reviewers, and readers of neuropsychological articles better understand the magnitude and potential clinical relevance of statistical associations. Historically, however, fewer than one-third of the articles published in neuropsychology journals included even a single measure of effect size (Bezeau & Graves, Reference Bezeau and Graves2001). The current study used a novel, reliable coding system to estimate the frequency and extent of effect size reporting in six widely read neuropsychology journals at five-year intervals from 1995 to 2020. Taken together, the results of this bibliographic study give neuropsychology journal articles a mixed report card for their effect size reporting practices. On the positive side, there has been noticeable progress in the reporting of effect sizes in neuropsychology journal articles in the past 20 years (see Figure 3). Over 90% of neuropsychology articles sampled from publication year 2020 include at least one measure of effect size in the Results section. Neuropsychology’s current effect size reporting frequency is approximately three times higher than it was 20 years ago (Bezeau & Graves, Reference Bezeau and Graves2001; Schatz et al., Reference Schatz, Jay, McComb and McLaughlin2005; Woods et al., Reference Woods, Weinborn and Lovejoy2003) and exceeds the reporting practices in the broader field of psychology (Peng et al., Reference Peng, Chen, Chiang and Chiang2013). Effect size reporting was fairly minimal in the 1995 issues of neuropsychology journals, which improved considerably by 2005 and was maintained at that level through 2020. This finding extends the earlier work by Schatz et al. (Reference Schatz, Jay, McComb and McLaughlin2005), who noted that articles in ACN improved their effect size reporting frequency after the publication of the APA journal article reporting standards in 1996.

Figure 3. Bar graph showing the frequency with which any effect sizes were reported in neuropsychological articles in the current study as compared to samples derived from Bezeau & Graves (Reference Bezeau and Graves2001; Journal of Clinical and Experimental Neuropsychology, Journal of the International Neuropsychological Society, and Neuropsychology), Schatz et al. (Reference Schatz, Jay, McComb and McLaughlin2005; Archives of Clinical Neuropsychology), and Woods et al. (Reference Woods, Weinborn and Lovejoy2003; Archives of Clinical Neuropsychology, The Clinical Neuropsychologist, Journal of Clinical and Experimental Neuropsychology, Journal of the International Neuropsychological Society, and Neuropsychology).

A more complex picture emerges when we consider the depth and breadth of effect size reporting across different sections in neuropsychology journal articles. The highest levels of effect size reporting across all years were found in the Results section. Moreover, the effect size reporting improvements over the years were driven by large increases in the Results section. The biggest increase in effect size reporting in the Results section occurred around the time of the formal APA statistical report (Wilkinson & The Task Force on Statistical Inference (Reference Wilkinson1999)) and these reporting rates were maintained over the next 15 years. Of course, the Results section is the traditional and expected location for reporting data on effect sizes, per the Journal Article Reporting Standards (JARS; Applebaum et al., Reference Appelbaum, Cooper, Kline, Mayo-Wilson, Nezu and Rao2018) and APA publication manual (2020). So in many ways, this finding makes good sense. Yet effect sizes are relevant and important beyond the Results section; they help the reader understand the prior literature, evaluate the study methods, and digest the conceptual and clinical relevance of the results (see Table 1).

Encouragingly, we also observed a medium-sized increase in effect size reporting in the Methods section for publication years 2015 and 2020. This is a promising, recent trend that perhaps reflects increased attention to describing the magnitude of important psychometric variables (e.g., internal consistency). Nevertheless, the frequency and magnitude of effect size reporting in the Method section of neuropsychological articles remains fairly modest. In 2020, fewer than two-thirds of the Method sections in published articles sampled from neuropsychology journals included effect sizes. Of those 2020 Method sections with any mention of an effect size, the median rating was 1 (range 1 to 3), suggesting that there is fairly minimal integration. Thus, there is still much room for improvement in the quality and quantity of effect size reporting in the Method section. Authors are encouraged to report exact effect size values, associated confidence intervals, and categorical interpretation (e.g., small, medium, large) of the reliability of their measures and procedures. Easily adopted examples include reporting sample-based (and literature-derived) indicators of internal consistency, test-retest reliability, and test validity (see Table 1). Effect sizes are also a welcome addition to the descriptive data and p-values that are often included in tables displaying the demographic and clinical features of the study groups.

In contrast to the improvements in effect size reporting in the Method and Results sections of neuropsychology journal articles, there were no meaningful changes in effect size reporting in the Abstract, Introduction, and Discussion sections. Indeed, these three article sections showed only small and statistically null changes in effect size reporting between 1995 to 2020. For the Abstract and Introduction sections, the mean ratings were <1 across all years and fewer than one-third of papers published in 2020 include effect sizes. These article sections are therefore clearly in need of better reporting and integration of effect sizes. In the Introduction, for example, integrating effect sizes from the extant literature can enhance the perceived rigor and persuasiveness of the study hypotheses. In our lab, this integrative process starts at the literature review stage, where we extract (and often calculate) effect sizes from the results of studies that are most directly related to our hypotheses. Consideration of effect sizes at this stage help us to better understand the inevitable inconsistencies in results across the literature, which is common given the variability in design, sample size, and measurement within and across the relevant studies. For example, the small effect sizes that can be detected in large samples with adequate powered are commonly dismissed as nonsignificant in smaller samples. Focusing on effect sizes can help minimize that noise, increase our understanding of the signal in the extant literature, and inform our sample size determination (Woods et al., Reference Woods, Rippeth, Conover, Carey, Parsons and Tröster2006). It also allows us to more accurately and effectively articulate and interpret the potential clinical or conceptual relevance of our study.

Interpretation of the data on effect size reporting trends in the Discussion section of neuropsychology journal articles is a bit more nuanced. On one hand, over 90% of articles published in 2020 include at least some mention of effect sizes. On the other hand, the mean effect size ratings for the Discussion section have remained around 1 since 1995, indicating nominal reporting and minimal integration. Only 81 of 326 (24.8%) Discussion sections in this sample of articles were assigned a coding system score >1 (i.e., effect sizes are “…mentioned multiple times or given some interpretive weight.”) Oftentimes, articles that include effect sizes in a Results section will include those metrics in the Discussion, but only in a cursory manner. In our experience, it is not uncommon to encounter an opening paragraph in the Discussion that recaps the primary findings and the magnitude of the observed effects, but then abandons all further consideration of effect sizes. Therefore, there is much room for growth in the full integration of effect sizes in Discussion sections of neuropsychology journal articles. Drawing from the Introduction, authors can frame the observed effect sizes in the context of the effect sizes that were observed in the prior literature.

Effect sizes can also guide the digestion of the theoretical and clinical relevance of a set of findings in the Discussion (see Table 1). For instance, they help us better understand the interpretation of both statistically significant (e.g., what is the clinical value of a significant effect that was observed in a very large sample at risk of Type I error?) and statistically null (e.g., perhaps a null finding from a small sample is actually quite comparable in magnitude to a significant finding from a much larger sample) findings. In terms of clinical utility, describing the magnitude of effect sizes can help authors and readers understand the real-world implications of the observed findings. Oftentimes this will involve weighing the magnitude of an observed effect (e.g., small normative sex differences on a clinical task) against such factors as costs (e.g., time or other resources) and psychometrics (e.g., measurement error) in determining its incremental clinical value (e.g., enhanced classification accuracy; Holdnack et al., Reference Holdnack, Drozdick, Weiss and Iverson2013). An illustration of such interpretive complexities can be found in our own prior work: In 2011, we reported an approximately 10% increase in the prevalence of impairment on traditionally “cortical” cognitive domains across two different treatment eras in very large samples of persons with HIV disease (N = 1,794) that was accompanied by a correspondingly small decrease in impairment on traditionally “subcortical” cognitive domains (Heaton et al., Reference Heaton, Franklin, Ellis, McCutchan, Letendre, Leblanc, Corkran, Duarte, Clifford, Woods, Collier, Marra, Morgello, Mindt, Taylor, Marcotte, Atkinson, Wolfson, Gelman and McArthur2011). The publication of this numerically small cognitive profile change sparked considerable interest in the idea that perhaps the neuropathogenesis of HIV-associated neurocognitive disorders had shifted in the modern treatment era (e.g., Tierney et al., Reference Tierney, Woods, Verduzco, Beltran, Massman and Hasbun2018); however, the very modest nature of the changes in the profile distribution of impairment in multifactorial cognitive domains was ostensibly only of limited immediate diagnostic relevance to most practitioners.

Strengths of the current study include the use of a detailed and reliable effect size coding system, the breadth of sampling across time and neuropsychology journals, and the random selection of articles. This study also has several limitations that should be considered in interpreting the findings. First, these data represent a sample of the neuropsychological literature and should be treated as such; that is, not all issues from these six journals were included and not all neuropsychological journals were sampled (e.g., Neuropsychology Review, Applied Neuropsychology). In addition, neuropsychologists publish in other specialty journals (e.g., neurology, psychiatry, and disease-specific subspecialty journals). Second, we acknowledge that not all statistical inference tests require effect sizes and not all sections of all articles would be expected to receive the highest score on our effect size rating system. Third, we cannot determine whether the effect sizes that were reported in these articles were appropriate or sufficient for the statistical inference tests that were conducted. Relatedly, we cannot comment on the accuracy or reliability of the interpretation of the observed effect sizes across the literature, which is a complex, nuanced process that depends on the particulars of the study question, sample, methods, data, and analyses.

Despite these limitations, the findings from the current study may serve as a call to neuropsychology journal editors, reviewers, and authors to expand their reporting and integration of effect sizes beyond the Results sections of our manuscripts. Given the effectiveness of the 1999 APA guidelines in improving simple reporting of effect sizes, the field might consider developing a more detailed unified standard for reporting and interpreting effect sizes in all sections of neuropsychological articles. If adopted, such guidelines could facilitate the process of reviewing and digesting complex literatures for clinicians and researchers (e.g., by reducing reliance on less precise effect sizes that are computed from inferential statistics in meta-analyses). It will be important to consider the diversity of possible effect sizes, the inclusion of confidence intervals, and the potential value of detailed labeling standards (e.g., Funder & Ozer, Reference Funder and Ozer2019). Such efforts might go beyond broadly binned effect size labels (e.g., small, medium, and large) in favor of more context-driven interpretations (Zakzanis, Reference Zakzanis2001). For instance, traditionally binned effect size metrics might be quite relevant in the context of research that aims to inform theory; however, effect size reporting in applied clinical studies might instead rely on metrics that are more directly useful to individual patients, such as sample overlap (Zakzanis, Reference Zakzanis2001) or classification accuracy (Woods et al., Reference Woods, Weinborn and Lovejoy2003) values. If ultimately adopted by journal editors and reviewers, a unified standard could enhance the rigor and sophistication of neuropsychological research by increasing the frequency with which authors integrate effect sizes into the framing and interpretation of the study findings. In particular, the systematic reporting and consideration of effect sizes in the Abstract could be especially powerful in establishing this practice as a cultural norm in the field, given that the Abstract is the most commonly read section of published manuscripts. Indeed, the JARS (Applebaum et al., Reference Appelbaum, Cooper, Kline, Mayo-Wilson, Nezu and Rao2018) explicitly recommends including estimates of effect sizes in the Abstract. At the simplest level, one might report the exact effect size statistic associated with a primary finding (e.g., a between-group difference) in the Results section of a structured Abstract; A more extensive approach might involve leveraging effect sizes from foundational studies to help frame the current study in the Objective section and/or including a magnitude descriptor of the observed effect (e.g., small) to the Discussion/Conclusion section of the Abstract.

We conclude by encouraging neuropsychological researchers and clinicians to think creatively about novel ways to generate and use effect sizes to improve our understanding of brain–behavior relationships and enhance our patients’ quality of life. As the data processing power of our field grows, so too does the importance of considering the conceptual and practical value of our findings. Such efforts might draw from well-established statistical approaches in general psychology that have had limited uptake in neuropsychological research; for example, Bayesian methods provide an alternative to NHST that might be used to quantify values of effect sizes that underlie a given set of data vis-à-vis a particular base rate (e.g., Huygelier, Gillebert, & Moors, Reference Huygelier, Gillebert and Moors2021). Our field has also had some success in adapting powerful statistical approaches from clinical psychology (e.g., reliable change indices; Chelune et al., Reference Chelune, Naugle, Lüders, Sedlak and Awad1993) and medicine (e.g., odds ratios; Bieliauskas et al., Reference Bieliauskas, Fastenau, Lacy and Roper1997) to help better understand the magnitude and clinical value of our findings. Indeed, framing our effect size refinement and development in the lens of evidence-based decision-making (e.g., Chelune, Reference Chelune2010; Smith, Reference Smith2002) and precision health (Centers for Disease Control and Prevention, 2022) may be particularly fruitful. No doubt there are other innovative ways to measure the magnitude and value of neuropsychological data that are currently circulating among our colleagues in allied health (e.g., pharmacy, nursing) and social (e.g., public health, economics) sciences and await our discovery.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/S1355617723000127

Acknowledgments

The authors have no conflicts of interest to declare. The Department of Psychology at the University of Houston provided the core infrastructure support for this study. The authors thank Troy A. Webber, Ph.D., ABPP, for his helpful comments on this manuscript.

Funding statement

None.

Conflicts of interest

None.

References

American Psychological Association. (1996). Task force on statistical inference. https://www.apa.org/science/leadership/bsa/statistical/tfsi-initial-report.pdf Google Scholar

American Psychological Association. (2020). Publication manual, 7th ed. Washington, DC: American Psychological Association.Google Scholar

Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board task force report. The American Psychologist, 73, 3–25. https://doi.org/10.1037/amp0000191 CrossRef Google Scholar PubMed

Bezeau, S., & Graves, R. (2001). Statistical power and effect sizes of clinical neuropsychology research. Journal of Clinical and Experimental Neuropsychology, 23, 399–406. https://doi.org/10.1076/jcen.23.3.399.1181 CrossRef Google Scholar PubMed

Bieliauskas, L. A., Fastenau, P. S., Lacy, M. A., & Roper, B. L. (1997). Use of the odds ratio to translate neuropsychological test scores into real-world outcomes: From statistical significance to clinical significance. Journal of Clinical and Experimental Neuropsychology, 19, 889–896. https://doi.org/10.1080/01688639708403769 CrossRef Google Scholar PubMed

Centers for Disease Control and Prevention (2022). Precision health: Improving health for each of us and all of us. https://www.cdc.gov/genomics/about/precision_med.htm Google Scholar

Chelune, G.J. (2010). Evidence-based research and practice in clinical neuropsychology. The Clinical Neuropsychologist, 24, 454–467. https://doi.org/10.1080/13854040802360574 CrossRef Google Scholar PubMed

Chelune, G. J., Naugle, R. I., Lüders, H., Sedlak, J., & Awad, I. A. (1993). Individual change after epilepsy surgery: Practice effects and base-rate information. Neuropsychology, 7, 41–52. https://doi.org/10.1037/0894-4105.7.1.41 CrossRef Google Scholar

Cicchetti, D. V. (1998). Role of null hypothesis significance testing (NHST) in the design of neuropsychologic research. Journal of Clinical and Experimental Neuropsychology, 20, 293–295. https://doi.org/10.1076/jcen.20.2.293.1165 CrossRef Google Scholar PubMed

Clark, C. M. (1999). Further considerations of null hypothesis testing. Journal of Clinical and Experimental Neuropsychology, 21, 283–284. DOI: 10.1076/jcen.20.2.293.1165 CrossRef Google Scholar PubMed

Cohen, J. (1994). The Earth is round (p <. 05). American Psychologist, 49, 97–1003. https://doi.org/10.1037/0003-066X.50.12.1103 CrossRef Google Scholar

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed). Hillside, N.J.: Lawrence Erlbaum Associates.Google Scholar

Demakis, G. J. (2006) Meta-analysis in neuropsychology: Basic approaches, findings, and applications. The Clinical Neuropsychologist, 20, 10–26. DOI: 10.1080/13854040500203282 CrossRef Google Scholar PubMed

Donders, J. (2000) From null hypothesis to clinical significance, Journal of Clinical and Experimental Neuropsychology, 22, 265–266. DOI: 10.1076/1380-3395(200004)22:2;1-1;FT265 CrossRef Google Scholar PubMed

Fisher, R. A. (1925). Statistical methods for research workers. London: Oliver & Boyd Google Scholar

Fritz, A., Scherndl, T., & Kuhberger. (2012). A comprehensive review of reporting practices in psychological journals: Are effect sizes really enough? Theory & Psychology, 23, 98–122. https://doi.org/10.1177/0959354312436870 CrossRef Google Scholar

Funder, D. C., & Ozer, D. J. (2019). Evaluating effect size in psychological research: Sense and nonsense. Advances in Methods and Practices in Psychological Science, 2, 156–168. https://doi.org/10.1177/2515245919847202.CrossRef Google Scholar

Heaton, R. K., Franklin, D. R., Ellis, R. J., McCutchan, J. A., Letendre, S. L., Leblanc, S., Corkran, S. H., Duarte, N. A., Clifford, D. B., Woods, S.P., Collier, A. C., Marra, C. M., Morgello, S., Mindt, M. R., Taylor, M. J., Marcotte, T. D., Atkinson, J. H., Wolfson, T., Gelman, B. B., McArthur, J. C., & The HNRC Group (2011). HIV-associated neurocognitive disorders before and during the era of combination antiretroviral therapy: differences in rates, nature, and predictors. Journal of Neurovirology, 17, 3–16. https://doi.org/10.1007/s13365-010-0006-1 CrossRef Google Scholar PubMed

Holdnack, J. A, Drozdick, L. W., Weiss, L. G., & Iverson, G. L. (2013). WAIS-IV, WMS-IV, and ACS: Advanced Clinical Interpretation. Elsevier Inc.Google Scholar

Huettner, M. I., Rosenthal, B. L., & Hynd, G. W. (1989). Regional cerebral blood flow (rCBF) in normal readers: Bilateral activation with narrative text. Archives of Clinical Neuropsychology, 4, 71–78. https://doi.org/10.1093/arclin/4.1.71 CrossRef Google Scholar PubMed

Huygelier, H., Gillebert, C. R., & Moors, P. (2021). The value of Bayesian methods for accurate and efficient neuropsychological assessment. Journal of the International Neuropsychological Society. https://doi.org/10.1017/S1355617721001120 Google Scholar PubMed

Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746–759. https://doi.org/10.1177/0013164496056005002 CrossRef Google Scholar

Millis, S. (2003). Statistical practices: The seven deadly sins. Child Neuropsychology, 9, 221–233. https://doi.org/10.1076/chin.9.3.221.16455.CrossRef Google Scholar PubMed

Peng, C. Y. J., Chen, L. T., Chiang, H. M., & Chiang, Y. C. (2013). The impact of APA and AERA guidelines on effect size reporting. Educational Psychology Review, 25, 157–209. https://doi.org/10.1007/s10648-013-9218-2 CrossRef Google Scholar

Reitan, R. M. (1955). The relation of the trail making test to organic brain damage. Journal of Consulting Psychology, 19, 393–394. https://doi.org/10.1037/h0044509 CrossRef Google Scholar PubMed

Schatz, P., Jay, K. A., McComb, J., & McLaughlin, J. R. (2005). Misuse of statistical tests in Archives of Clinical Neuropsychology publications. Archives of Clinical Neuropsychology, 20, 1053–1059. DOI: https://doi.org/10.1016/j.acn.2005.06.006 CrossRef Google Scholar PubMed

Smith, G. E. (2002). What is the outcome we seek? A commentary on Keith et al. (2002). Neuropsychology, 16, 432–433. https://doi.org/10.1037//0894-4105.16.3.432 CrossRef Google Scholar

Sun, S., Pan, W., & Wang, L. L. (2010). A comprehensive review of effect size reporting and interpreting practices in academic journals in education and psychology. Journal of Educational Psychology, 102, 989–1004. https://doi.org/10.1037/a0019507 CrossRef Google Scholar

Tierney, S., Woods, S. P., Verduzco, M., Beltran, J., Massman, P. J., & Hasbun, R. (2018). Semantic memory in HIV-associated Neurocognitive Disorders: An evaluation of the “cortical” versus “subcortical” hypothesis. Archives of Clinical Neuropsychology, 33, 406–416. https://doi.org/10.1093/arclin/acx083 CrossRef Google Scholar PubMed

Wilkinson, L., & The Task Force on Statistical Inference (1999). Statistical methods in psychology journals: Guidelines and explanations. The American Psychologist, 54, 594–604. https://doi.org/10.1037/0003-066X.54.8.594 CrossRef Google Scholar

Woods, S. P., Rippeth, J. D., Conover, E., Carey, C. L., Parsons, T. D., & Tröster, A. I. (2006). Statistical power of studies examining the cognitive effects of subthalamic nucleus deep brain stimulation in Parkinson’s disease. The Clinical Neuropsychologist, 20, 27–38. https://doi.org/10.1080/13854040500203290 CrossRef Google Scholar PubMed

Woods, S. P., Weinborn, M., & Lovejoy, D. W. (2003). Are classification accuracy statistics underused in neuropsychological research? Journal of Clinical and Experimental Neuropsychology, 25, 431–439. https://doi.org/10.1076/jcen.25.3.431.13800 CrossRef Google Scholar PubMed

Sweet, J. J., Meyer, D. G., Nelson, N. W., & Moberg, P. J. (2011). The TCN/AACN 2010 “salary survey”: Professional practices, beliefs, and incomes of US neuropsychologists. The Clinical Neuropsychologist, 25, 12–61. https://doi.org/10.1080/13854046.2010.544165 CrossRef Google Scholar PubMed

Yates, F. (1951). The influence of “statistical methods for research workers” on the development of the science of statistics. Journal of the American Statistical Association, 46, 19–34. https://doi.org/10.1080/01621459.1951.10500764 Google Scholar

Zakzanis, K. K. (2001). Statistics to tell the truth, the whole truth, and nothing but the truth: Formulae, illustrative numerical examples, and heuristic interpretation of effect size analyses for neuropsychological researchers. Archives of Clinical Neuropsychology, 16, 653–667. https://doi.org/10.1093/arclin/16.7.653 CrossRef Google Scholar PubMed

Table 1. Examples of the ways in which effect sizes can be integrated into different sections of a standard article

Figure 2. Bar graph showing the frequency with which any effect sizes were mentioned in different sections of 47 articles published across six neuropsychology journals in 2020.

Figure 3. Bar graph showing the frequency with which any effect sizes were reported in neuropsychological articles in the current study as compared to samples derived from Bezeau & Graves (2001; Journal of Clinical and Experimental Neuropsychology, Journal of the International Neuropsychological Society, and Neuropsychology), Schatz et al. (2005; Archives of Clinical Neuropsychology), and Woods et al. (2003; Archives of Clinical Neuropsychology, The Clinical Neuropsychologist, Journal of Clinical and Experimental Neuropsychology, Journal of the International Neuropsychological Society, and Neuropsychology).

Woods et al. supplementary material

File 20.5 KB

Article contents

Historical trends in reporting effect sizes in clinical neuropsychology journals: A call to venture beyond the results section

Abstract

Keywords

Introduction

Methods

Article effect size reporting coding system

Article sampling and coding procedure

Data analysis

Results

Discussion

Supplementary material

Acknowledgments

Funding statement

Conflicts of interest

References

Woods et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests