Foster et al.’s (Reference Foster, Steel, Harms, O’Neill and Wood2024) focal article provides a bracing reminder of a central tenet of applied psychology. Individual difference traits, such as general mental ability (GMA) or personality, and trait measurement methods, such as interviews or selection tests, are consistently related to on-the-job performance. We take issue with Foster et al. (Reference Foster, Steel, Harms, O’Neill and Wood2024) because some of the information they offer is limiting and, in our opinion, inaccurate in the service of expediency. This is particularly the case in their table, which attempts to assemble a range of predictor and job performance correlations for use as a handy reference. However, their table and other parts of their argumentation omit consideration of multimethod measurement of performance and the sources of multirater variance on which organizations should focus.
The importance of multimethod criteria: moving beyond sole reliance on supervisory ratings
Foster et al. (Reference Foster, Steel, Harms, O’Neill and Wood2024) write that supervisory ratings of job performance “are widely considered the primary criterion” for personnel selection measures (p. 1). This is not the case nor should it be. Selection research has a rich history with other criterion measures, and multimethod approaches to performance measurement are preferable. Hunter’s (Reference Hunter1983b) original validity generalization study criteria included both training (typically measured using knowledge tests) and job performance (typically measured using supervisory ratings) criteria. Later, meta-analytic summaries list validities separately for these two criteria (Schmidt, Reference Schmidt2013; Schmidt & Hunter, Reference Schmidt and Hunter1998). Many entry-level jobs have a formal training period and training performance is a critical criterion. Indeed, the U.S. military uses training as a criterion for military selection tests (Brown et al., Reference Brown, Le and Schmidt2006).
As Hunter (Reference Hunter1983b) mentioned, training performance is a measure of job knowledge, which is another key criterion for validation studies including the U.S. Army’s Project A (McHenry et al., Reference McHenry, Hough, Toquam, Hanson and Ashworth1990) and civilian federal government studies (Paullin et al., Reference Paullin, Putka, Tsacoumis, Colberg and Paullin2010; Schmidt et al., Reference Schmidt, Hunter, Outerbridge and Trattner1986; van Rijn & Payne, Reference van Rijn and Payne1980). Work sample measures of job performance, including hands-on-performance tests (HOPTs) and low-fidelity job simulations, are also often used as criteria. Work sample criterion measures were used as criteria for General Aptitude Test Battery validation studies (Salgado & Moscoso, Reference Salgado and Moscoso2019). Cucina et al. (Reference Cucina, Burtnick, De la Flor, Walmsley and Wilson2024) meta-analyzed Armed Services Vocational Aptitude Battery validities using HOPTs, which were lauded as the “gold standard” criterion for job performance (Abrahams et al., Reference Abrahams, Mendoza, Held, Held, Carretta, Johnson and McCloy2015, p. 45). Low-fidelity job simulations, such as walk-through performance tests in which incumbents walk trained raters through how they would perform job tasks (Ree et al., Reference Ree, Earles and Teachout1994) or paper-and-pencil work simulations (Hayes et al., 2002), are also excellent criteria. There are other criteria beyond supervisory ratings with validity evidence that should be considered when evaluating how well selection tests work. Causal path modeling involving GMA as a predictor indicates that supervisory ratings are a distal outcome, whereas job knowledge and objective measures of task performance (e.g., work samples, HOPTs) are more proximal criteria (Hunter, Reference Hunter, Landy, Zedeck and Cleveland1983a, Reference Hunter1986). In fact, the relationship between GMA and supervisory ratings is entirely mediated by the proximal criteria.
Other recent reviews have also, incorrectly, focused on supervisory ratings as the sole criterion measure (Sackett et al., Reference Sackett, Zhang, Berry and Lievens2022). Perhaps supervisory ratings provide an aura of an independent and external third-party outcome. After all, from a naïve perspective, who would better known an employee’s performance than their supervisor? Yet many issues and biases can impact supervisory ratings, including opportunity to observe (MacLane et al., Reference MacLane, Cucina, Busciglio and Su2020), the use or nonuse of behaviorally anchored ratings scales (BARS), frame-of-reference training, leader–employee relationships (e.g., leader–member exchange; Martin et al., Reference Martin, Guillaume, Thomas, Lee and Epitropaki2016), data collection procedures and proctoring (Grubb, Reference Grubb, Adler and Tippins2011), rating adjustment policies (Al Ali et al., Reference Al Ali, Garner and Magadley2012), and social and goal-related issues (Murphy & Cleveland, Reference Murphy and Cleveland1995). Foster et al. (Reference Foster, Steel, Harms, O’Neill and Wood2024) offer suggestions for reducing some biases such as statistically eliminating leniency and strictness. However, these methods can have measurement side effects. If BARS are used, statistically adjusting the ratings may remove the link between the numerical ratings and the behavioral anchors.
From a practical perspective, supervisory ratings for research purposes can be collected easily and cheaply (e.g., using an unproctored online survey platform) and archival administrative ratings may be available. However, sometimes one gets what one pays for. Administrative ratings often lack variance compared to research-based ratings. This was noted in a U.S. General Accounting Office study of the promotions process for special agents at the Drug Enforcement Administration. Supervisors rated the job performance of candidates for promotions to supervisory positions. U.S. General Accounting Office (2003, p. 27) reported that the average ratings were “uniformly exceptional—almost a perfect 5,” which calls into question the “critical importance in other HR decisions, such as promotions” of supervisory job performance ratings (Foster et al., Reference Foster, Steel, Harms, O’Neill and Wood2024 p. 5). Research-based ratings can also suffer quality issues (Grubb, Reference Grubb, Adler and Tippins2011).
The importance of multirater criteria: measuring ratee variance shared across supervisors
Foster et al. (Reference Foster, Steel, Harms, O’Neill and Wood2024) hypothesize that criterion-related validity will improve by predicting a supervisor’s unique perspective on an employee’s performance (e.g., after removing variance in performance ratings shared with other supervisors). They suggest “re-establishing the importance (i.e., the weights) that individual supervisors give to the different performance dimensions” (p. 13) and matching applicants to supervisors based on this. At one point, they almost go as far as to equate “individual supervisors” and “sole proprietors” (p. 14).
Here we present an alternative view. From the organization’s perspective, it is primarily desirable to predict the ratee variance, not ratee × rater variance, shared across supervisors. Organizations typically hire I-O psychologists to work for the organization’s benefit rather than for an individual supervisor. Aspects of performance that all supervisors agree upon should be the criterion of interest rather than idiosyncratic viewpoints of performance from a particular supervisor. Matrixed teams are common in many organizations with employees reporting to multiple supervisors, and it is rarely the case that employees are selected to work only for one supervisor throughout their tenure. For frontline and hourly positions involving shiftwork, employees may report to multiple supervisors in one shift or different ones on different days. Supervisors and employees also transfer to other positions within an organization as their careers progress. Thus, organizations are typically interested in hiring the best employees for the organization (i.e., those who will be viewed by multiple supervisors as high performers) rather than staffing fiefdoms for individual supervisors filled with employees having the supervisors’ pet competencies or subject to idiosyncratic supervisor weighting of competencies. Asking individual supervisors to weight the importance of different competencies for one vacancy for a larger job in an organization negates the critical role of conducting a thorough job analysis. This is paramount to conducting separate job analyses for each supervisor with a sample size of one, focusing explicitly on individual differences in supervisory job analysis ratings and using that disagreement (or error) to improve validity. This approach could introduce any number of biases into selection systems, such as the similar-to-me bias or self-serving bias (Cucina et al., Reference Cucina, Martin, Vasilopoulos and Thibodeaux2012).
Collecting, maintaining, and using this data will prove challenging in practice. Rather than having one test for a larger job (e.g., cashier), an organization would need to have separate tests or cutoff scores for each vacant position for that job, and these would need to be tied to individual supervisors. Different selection standards for the same job title within the same organization would be applied and applicants may view this as unfair. It also presents business necessity and job-relatedness issues if a particular supervisor’s selection system yields adverse impact whereas another’s does not.
We sympathize with the motivation to maximize criterion-related validity coefficients. However, fully embracing Foster et al. (Reference Foster, Steel, Harms, O’Neill and Wood2024)’s methodology could lead to an instance of Kerr’s (Reference Kerr1975) folly, whereby we reward ourselves for increased validity coefficients while hoping for an increase in job performance and utility which does not actual materialize. Instead, we recommend matching employees to the job via selection systems that are validated to predict ratee variance in job performance ratings that is shared across multiple raters, not idiosyncratic ratee × rater variance, and are based on a job analysis for the job itself rather than a single supervisor’s job analysis ratings. Doing so leads to predictors that are standardized and that have improved construct validity. Furthermore, path analytical research indicates that the ratee variance that is shared across supervisors largely depends on job knowledge and task performance, which should also be considered as criteria in validation efforts.
A call for a comprehensive meta-analytic intercorrelation table
The second problem perpetuated by Foster et al. (Reference Foster, Steel, Harms, O’Neill and Wood2024) is the presentation of a meta-analytic intercorrelation table of predictor traits and methods. We understand the goals of providing this table. However, the complexities of putting such a table together are significant because the meta-analysis literature contains differing levels of correction for measurement error, differing levels of correction for range restriction, different predictor reliabilities and construct validities, and different research populations. Additionally, some selection measures, such as interviews, are methods combining different sources of true score variance rather than tests of traits. Ideally, such a table would include the full range of predictors covered by Schmidt and Hunter (Reference Schmidt and Hunter1998) and Schmidt (Reference Schmidt2013). It would also include corrections for direct (Case II) and indirect (Case IV) range restriction in both the criterion-related validities and predictor correlations in addition to observed correlations that Foster et al. (Reference Foster, Steel, Harms, O’Neill and Wood2024) presented. Correlations with other types of criteria besides just supervisory ratings are needed, as are incorporation of moderators (e.g., job complexity). Examples of meta-analytic tables with consistent corrections include O’Boyle et al. (Reference OʼBoyle, Humphrey, Pollack, Hawver and Story2011) and Schmidt and Hunter (Reference Schmidt and Hunter1998).
Foster et al. (Reference Foster, Steel, Harms, O’Neill and Wood2024) start an important conversation. What is needed now is a table with a broader range of predictors and criteria and consistent degrees of corrections for biases and error. Presenting result as validity coefficients (r) instead of squared validities (r 2) would avoid the issues described by Funder and Ozer (Reference Funder and Ozer2019) and provide a metric linearly related to utility. Finally, a greater realization that supervisory ratings include multiple sources of variance could lead to better validation studies that describe validity for the work role rather than for a single rater. Unfortunately, a table such as the one that Foster et al. (Reference Foster, Steel, Harms, O’Neill and Wood2024) present results in unfair comparisons (Cooper & Richardson, Reference Cooper and Richardson1986) of apples and potatoes.