In this paper we focus on the lowered validity for general mental ability (GMA) tests by presenting: (a) a history of the range restriction correction controversy; (b) a review of validity evidence using various criteria; and (c) multiple paradoxes that arise with a lower GMA validity.
How did we get here? History of the range restriction correction controversy
Sackett et al. (Reference Sackett, Zhang, Berry and Lievens2022, Reference Sackett, Zhang, Berry and Lievens2023) revisit an old issue with the General Aptitude Test Battery (GATB) studies underlying Schmidt and Hunter’s (Reference Schmidt and Hunter1998) .51 validity estimate. The GATB was a career guidance tool used by state unemployment offices to refer the unemployed to employers. Unlike typical selection settings, employers were not recruiting and selecting applicants based on GATB scores and there were no job-specific local applicant pools. Use of a normative SD from the entire workforce was tenable for the GATB as unemployed jobseekers, who were recently laid off, likely represented the U.S. workforce. When career counselors used the GATB to help people choose jobs, the “applicant pool” was the U.S. workforce and people wishing to enter it.
Sackett et al. (Reference Sackett, Zhang, Berry and Lievens2022) critique Hunter’s (Reference Hunter1983a) use of (a) a national workforce SD instead of local job-specific applicant SDs and (b) corrections for predictive versus concurrent studies. They state these are “previously unnoticed flaws,” but a National Academy of Sciences report covered the SD issue (Hartigan & Wigdor, Reference Hartigan and Wigdor1989, pp.166–7). Later studies show this practice only slightly lowers applicant pool SDs (Lang et al., Reference Lang, Kersting and Hulsheger2010; Sackett & Ostgaard,Reference Sackett and Ostgaard1994). Schmidt et al. (Reference Schmidt, Le, Oh and Shaffer2007) discussed this and cited Pearlman et al.’s (Reference Pearlman, Schmidt and Hunter1980) finding of similar observed validities for predictive (r = .23) and concurrent studies (r = .21), suggesting similar range restriction for both. Sackett et al. (Reference Sackett, Zhang, Berry and Lievens2022) also asserted Hunter’s (Reference Hunter1983a) .67 u x value was implausibly low; but no solid explanation (e.g., educational requirements or self-sorting) of why a .67 u x occurred exists. We reviewed jobs from the GATB studies and concluded that most were entry-level and only 4% of the studies had jobs requiring a college/advanced degree. Because people cannot estimate their GMA well (Freund & Kasten, Reference Freund and Kasten2012), it is unlikely they would self-sort into jobs based on GMA.
Overlooked validity evidence from other job performance criteria
Sackett et al.’s (Reference Sackett, Zhang, Berry and Lievens2022, Reference Sackett, Zhang, Berry and Lievens2023) validity estimates focus on supervisory ratings, which may be deficient (SIOP Principles, 2018) because these may not reflect employees’ job knowledge. One cannot dismiss GMA as a predictor based on “low” correlations with supervisory ratings without considering the predictor deficiency of not including the effect of GMA on acquisition and transfer of job knowledge. Hunter (Reference Hunter, Landy, Zedick and Cleveland1983b) demonstrated that job knowledge was the best predictor of job performance. Huang et al.’s (Reference Huang, Blume, Ford and Baldwin2015) meta-analysis showed that GMA is the best predictor of transfer of training specifically for maximal performance. This spurs a question Sackett et al. (Reference Sackett, Zhang, Berry and Lievens2023) do not address: what is the construct validity of supervisor ratings as a criterion? Fortunately, data on objective criterion measures are available in validity studies. In Table 1, we show Hayes et al.’s (Reference Hayes, McElreath and Reilly2003) meta-analytic correlations indicating that supervisory ratings are somewhat distinct from work simulations and training performance. Hayes et al., (Reference Hayes, McElreath and Reilly2003; Hayes & Reilly, Reference Hayes, Reilly and Hayes2002) also meta-analyzed the criterion validity of reasoning tests. As shown in Table 2, their validities for supervisory ratings were similar to Sackett et al.’s (Reference Sackett, Zhang, Berry and Lievens2022, Reference Sackett, Zhang, Berry and Lievens2023). However, validities were much higher when work simulations (i.e., low-fidelity simulations of task performance and procedural knowledge) were used as criteria.
Note. Workplace criterion measures include scores in training, scores for work simulations, and supervisor ratings. Values along the diagonal are default generally accepted reliability estimates for each workplace measure. Values below the reliability diagonal are uncorrected zero-order correlations between each workplace measure. Correlations above the reliability diagonal are corrected for restriction of range (u = .8072) and measurement error. For the observed correlation between training performance and work simulation performance, k = 4, n = 1,743, SD = .0799; for the observed correlation between training performance and supervisor ratings, k = 5, n = 2,003, SD = .0813; and for the observed correlation between work simulation performance and supervisor ratings, k = 6, n = 2,991, SD = .0689.
Note. All predictors were measures of reasoning (i.e., nonverbal reasoning, etc.); LBM: Logic-Based Measurement (LBM), which are tests developed using an established logical framework for reasoning items (Simpson et al., Reference Simpson, Nester and Palmer2007); k = number of studies; N = combined sample sizes.
McHenry et al. (Reference McHenry, Hough, Toquam, Hanson and Ashworth1990) reported similar results from the U.S. Army’s Project A. GMA predicted a supervisory/peer job performance ratings method factor with a range restriction (but not criterion unreliability) corrected validity of .15. GMA predicted general soldiering proficiency with a validity of .65 corrected for range restriction (.47 uncorrected) and core technical proficiency with a validity of .63 corrected for range restriction (.43 uncorrected). These two criteria were combinations of job and training knowledge test scores, supervisory and peer ratings, and hands-on performance test (HOPT) scores.
Cucina et al. (Reference Cucina, Burtnick, De la Flor, Walmsley and Wilson2023a) meta-analyzed GMA’s prediction of HOPTs using data from four U.S. military branches. Multivariate range restriction corrections were applied in this meta-analytic database and the Project A database, a correction endorsed by a National Academy of Sciences review (Wigdor & Green, Reference Wigdor and Green1991) and corroborated by Held et al. (Reference Held, Carretta, Johnson and McCloy2015) and Cucina et al. (Reference Cucina, Burtnick, De la Flor, Walmsley and Wilson2023a). Cucina et al. (Reference Cucina, Burtnick, De la Flor, Walmsley and Wilson2023a) found an operational validity of .44 for the Armed Forces Qualifying Test and .55 for aptitude indices (i.e., linear combinations of cognitive tests).
Hunter (Reference Hunter, Landy, Zedick and Cleveland1983b) meta-analyzed the validity of GMA for supervisory ratings and HOPTs using non-GATB data. Using his data, we found weighted average validities (corrected for criterion unreliability but not range restriction) of .49 (n = 3281; k = 11) using HOPTs and .27 using supervisory ratings (n = 3,605; k = 12). Ree et al. (Reference Ree, Earles and Teachout1994) reported a .42 meta-analytic validity for GMA with interview-based job performance measures correcting for range restriction.
Sackett et al. (Reference Sackett, Zhang, Berry and Lievens2023) critiqued the use of HOPTs as a criterion stating these were measures of maximal, not typical, performance. It is unclear if this is true or why it is an issue. Validity study participants are often told that their data will be used only for research purposes, which may reduce their motivation levels from maximal to typical. McHenry et al. (Reference McHenry, Hough, Toquam, Hanson and Ashworth1990) reported that temperament/personality had validities of .26 for core technical proficiency and .25 for general soldiering proficiency. These validities are typical for personality and are higher than the .18 validity obtained using the supervisory/peer ratings method factor criterion. This suggests that objective performance measures are not entirely measures of maximal performance. Further, work simulations/HOPTs are measures of core task proficiency, which is an important criterion.
Conceptual paradoxes for GMA research findings
Believing a lowered GMA validity estimate leads to five empirical paradoxes.
Paradox 1: GMA predicts firearms proficiency better than it does overall job performance
There are numerous cognitive decisions and activities associated with job performance. Hunt and Madhyastha (Reference Hunt and Madhyastha2012) identified a large GMA-based factor in job analysis data from O * NET. A job analysis of 105 Federal government jobs identified 42 core tasks and the competency with the best linkage to those tasks was reasoning, which Carroll (Reference Carroll1993, p. 196) stated is “at or near the core of what is ordinarily meant by intelligence” (Pollack et al., Reference Pollack, Simons and Patel1999; Simpson et al., Reference Simpson, Nester and Palmer2007). Yet, in Cucina et al.’s (Reference Cucina, Wilson, Hayes, Walmsley and Votraw2023b) largest dataset, reasoning had an operational validity of .268 (n = 14,892) in predicting how well individuals could aim and shoot a handgun at a stationary target. It is paradoxical that GMA had higher validity for a largely psychomotor task than for overall job performance if all that matters in performance are supervisory ratings.
Paradox 2: Similarly corrected GMA validities for training performance are corroborated
Schmidt and Hunter’s (Reference Schmidt and Hunter1998) .56 validity estimate for GMA tests with training performance used the GATB studies and the same range restriction correction that Sackett et al. (Reference Sackett, Zhang, Berry and Lievens2022, Reference Sackett, Zhang, Berry and Lievens2023) critiqued. However, studies using other correction processes yielded similar validities, including results in Table 2. Brown et al. (Reference Brown, Le and Schmidt2006) reported a .546 validity for GMA across 10 Navy training schools (n = 26,097). Welsh et al., (Reference Welsh, Kucinkas and Curran1990, p. 36) reported a range restriction corrected validity of .44 for the AFQT with final school grades (n = 224,048 cases)Footnote 1.
Paradox 3: There is no free lunch—GMA test proxies can be g-loaded
Sackett et al. (Reference Sackett, Zhang, Berry and Lievens2023) report a .40 validity for job knowledge tests compared to the .23 validity for GMA tests. However, job knowledge tests are g-loaded. Using Hunter’s (Reference Hunter, Landy, Zedick and Cleveland1983b) GMA-job knowledge correlations, we computed an average correlation of .50 (n = 3,372; k = 11). This size correlation is typical of those between the ASVAB and GATB subtests. Paradoxically, an employer eschewing GMA in favor of job knowledge is unwittingly testing applicants’ GMA.
Paradox 4: GMA’s validity is decreasing yet job complexity is increasing
The validity of GMA tests has purportedly decreased .51 to .31 to .23. Many of the jobs in the GATB validity studies were manufacturing and medium complexity jobs. The number of U.S. employees in manufacturing has decreased significantly since the 1970s (Gascon, Reference Gascon2022). Today’s U.S. economy is more focused on knowledge work and there is an increased use of technology in blue collar jobs. This should lead to higher job complexity and higher GMA validities because job complexity moderates GMA’s validity (Schmidt & Hunter, Reference Schmidt and Hunter1998). This is paradoxical as the validity of GMA is decreasing yet job complexity is increasing in the U.S.
Paradox 5: The folly of selecting for contextual performance but expecting proficiency
Sackett et al. (Reference Sackett, Zhang, Berry and Lievens2023) state that contextual criteria are not as predictable by GMA as are task-based criteria. Although supervisory ratings are easily obtained and provide an aura of independent authoritative judgment, they are clouded by social/organizational factors (Murphy & Cleveland, Reference Murphy and Cleveland1995), impacted by supervisors’ opportunity to observe performance (MacLane et al., Reference MacLane, Cucina, Busciglio and Su2020), and biased by other factors (e.g., subjectivity, criterion deficiency, poorly developed rating scales; Courtney-Hays et al., Reference Courtney-Hays, Carswell, Cucina, Melcher and Vassar2011). Practices in how ratings are collected impact criterion-related validity (Grubb, Reference Grubb and Adler2011). Sackett et al. (Reference Sackett, Zhang, Berry and Lievens2023) caution against comparing meta-analytic validities as “we do not have clear understanding of the specific components underlying performance ratings”; this is good advice when selecting for “non-cognitive skills” but paradoxical (or worse) when expecting people to be trainable and develop task proficiency.
Conclusion
Criticisms of the range restriction correction procedure for the GATB validity studies are not new. The correction procedure could be tenable for the original use of GATB scores which was to predict how well individuals representative of the U.S. workforce would perform in different jobs. Making an inferential leap to local selection settings may result in lower validities when supervisory ratings serve as criteria, however, validities using objective job performance measures (i.e., work samples, work simulations, and HOPTs) are still near .51.