Hostname: page-component-745bb68f8f-cphqk Total loading time: 0 Render date: 2025-01-08T09:47:13.814Z Has data issue: false hasContentIssue false

The InterModel Vigorish as a Lens for Understanding (and Quantifying) the Value of Item Response Models for Dichotomously Coded Items

Published online by Cambridge University Press:  01 January 2025

Benjamin W. Domingue*
Affiliation:
Stanford University
Klint Kanopka
Affiliation:
Stanford University
Radhika Kapoor
Affiliation:
Stanford University
Steffi Pohl
Affiliation:
Freie Universität Berlin
R. Philip Chalmers
Affiliation:
York University
Charles Rahal
Affiliation:
University of Oxford
Mijke Rhemtulla
Affiliation:
University of California, Davis
*
Correspondence should be made to Benjamin W. Domingue, Graduate School of Education, Stanford University, Santa Clara, USA. Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

The deployment of statistical models—such as those used in item response theory—necessitates the use of indices that are informative about the degree to which a given model is appropriate for a specific data context. We introduce the InterModel Vigorish (IMV) as an index that can be used to quantify accuracy for models of dichotomous item responses based on the improvement across two sets of predictions (i.e., predictions from two item response models or predictions from a single such model relative to prediction based on the mean). This index has a range of desirable features: It can be used for the comparison of non-nested models and its values are highly portable and generalizable. We use this fact to compare predictive performance across a variety of simulated data contexts and also demonstrate qualitative differences in behavior between the IMV and other common indices (e.g., the AIC and RMSEA). We also illustrate the utility of the IMV in empirical applications with data from 89 dichotomous item response datasets. These empirical applications help illustrate how the IMV can be used in practice and substantiate our claims regarding various aspects of model performance. These findings indicate that the IMV may be a useful indicator in psychometrics, especially as it allows for easy comparison of predictions across a variety of contexts.

Type
Theory & Methods
Copyright
Copyright © 2024 The Author(s), under exclusive licence to The Psychometric Society

1. Introduction

The utilization of statistical models for item responses gathered in psychological assessments necessitates tools that describe their relative performance in a given scenario. A wide variety of indices for quantifying the quality of models for item responses exists (for a recent review, see Chapters 17–20 in Van der Linden, 2017a). However, while there are many popular approaches, they tend to have limitations. In particular, many indices do not have values that readily generalize across samples. In some cases, this can be due to a dependency on sample size. In other cases, it can be due to a poorly understood sensitivity to item difficulty. Further, such indices may not be readily applicable in scenarios wherein interest is in out-of-sample prediction, an important shortcoming given the increased relevance of prediction in many settings (Rahal et al., 2022; Watts et al., 2018; Hofman et al., Reference Hofman, Watts, Athey, Garip, Griffiths and Kleinberg2021).

When using predictive models for dichotomous item responses, a metric that is portable—i.e., its values can be consistently and meaningfully interpreted to evaluate the predictive value of different models for item responses—would be valuable as it would allow for the comparison of a variety of modeling choices in various data contexts. We also distinguish between portability and the metric’s sensitivity to changes in predictive accuracy. The metric needs to be sensitive to factors that change the accuracy of predictions. For simplicity, we articulate this distinction via a consideration of sample size in the linear regression context. Consider a simple linear regression model in which we are predicting some outcome y via y ^ = E ( y | x ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hat{y}=\mathbb {E}(y|x)$$\end{document} . In such a scenario, we anticipate better predictions as sample size increases (see discussion below Eqn 6.35 on p.210 of Wooldridge, 2013). In this sense, we would anticipate the difference | E ( y | x ) - y | \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathbb {E}(y|x)-y|$$\end{document} to get smaller as the sample size increases. However, the differences | E ( y | x ) - y | \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|\mathbb {E}(y|x)-y|$$\end{document} are not portable given that they depend on the scale of y. In the linear regression context, it would perhaps be sufficient to rescale y. Binary outcomes are more complex given that both the mean and variance depend upon the single parameter of a Bernoulli random variable. For binary outcomes such as dichotomous responses, the InterModel Vigorish (IMV; Domingue et al., 2021) is designed to resolve this problem of portability given its construction and we illustrate its appropriate sensitivity to factors that improve the prediction of novel responses in various simulation studies below.

Here, we build on initial work (i.e., the results in Domingue et al., 2021) to showcase how the IMV can be used in psychometric settings. In this paper, we conduct a series of simulation studies showing how the IMV can be used to understand the differences in predictions derived from IRT models for dichotomous items under a number of conditions. In addition, we leverage the portability of the IMV to make direct comparisons that allow us to describe the degree to which specific modeling choices impact prediction in controlled settings. We study the implications of a wide range of choices—the differences of distributions of key model parameters, of sample size, Bayesian priors, and estimation algorithms—on prediction quality in a metric that is both portable across these settings and whose values can be readily extended to work with empirical data. To illustrate this last point, we conduct empirical work involving a large volume of data (89 datasets from the Item Response Warehouse (IRW); Domingue & Kanopka 2023). These empirical analyses demonstrate the utility of the IMV in practice and assess the degree to which the modeling innovations considered in the simulations lead to predictive gains anticipated in idealized settings.

We also study the behavior of the IMV vis-à-vis the behavior of other alternative indices (e.g., information criteria; Burnham & Anderson, Reference Burnham and Anderson2004). To be clear about our expectations: we anticipate that all indices are likely to provide similar information if the objective is to simply determine whether one model is a “better” fit to data than another. Our interest here is in what these indices tell us about the differences between models in relative rather than absolute ways (i.e., how much better is one model than another?). Initial evidence (Domingue et al., 2021) suggests that the IMV is quite sensitive to estimation error in a way that other indices (e.g., AUC; Hanley & McNeil, Reference Hanley and McNeil1982) are not. These differences suggest that when there is interest in more than simply ranking the performance of models, the IMV provides novel information (that is closely related to errors in prediction) which may be useful in understanding the quantitative differences in their predictions. In this paper, we complement those findings with additional results focusing on indices widely used with latent variable models. These new results focus closely on the issue of sample size and its association with prediction quality.

This paper is organized as follows. We first discuss other indices before introducing the IMV in the context of item response models for dichotomously scored item responses. We then evaluate its performance in a variety of simulations both in isolation and in comparison to other indices. The metric’s applicability to empirical data is then illustrated in a wide variety of datasets. We close with a discussion of the IMV’s potential use in psychometrics.

1.1. Conventional Fit Indices for IRT Models

There is a substantial literature on assessing the degree to which a given IRT model aptly characterizes a given set of item responses. Rather than a complete review, we focus on key points related to the index we develop here. Many approaches involve computation of quantities for the purpose of assessing the data-model match. We will generically call these quantities “indices.” There are several types of indices to consider as alternatives (Swaminathan et al., Reference Swaminathan, Hambleton and Rogers2006), some of which are meant to interrogate specific assumptions of the relevant IRT model. These include, for example, the infit and outfit statistics associated with the one parameter logistic (1PL) model (Wu and Adams, 2013) and analyses of dimensionality (Stout, Reference Stout1987); we do not further discuss such indices here. There is also research on item- (Köhler et al., Reference Köhler, Robitzsch and Hartig2020) and person-level (see Chapter 6 in Van der Linden, 2017b) fit indices; we do not focus on such indices but return to this issue in Sect. 5. Rather, we focus on a range of approaches meant to describe and compare the overall “fit”—by which we mean, roughly, “how close are predictions to observations?”—of a given model to a dataset. We include discussion of indices that have distinctive features but that are similar in the sense that they are potential tools for the job of selecting or differentiating among various models.

One approach includes classical inferential tests of differences between models based on the likelihood ratio test. This is a widely used approach but hinges on the availability of large samples (Mavridis et al., 2007). In finite samples, there is frequent interest in likelihood-based approaches that correct for overfitting by favoring parsimony (Kang and Cohen, Reference Kang and Cohen2007). Such indices—e.g., Akaike’s information criterion (AIC; Akaike, Reference Akaike1973), Schwarz’s Bayesian information criterion (BIC; Schwarz, 1978), and the deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & Van Der Linde, Reference Spiegelhalter, Best, Carlin and Van Der Linde2002)—compare the fit of nested and non-nested models (including null models). One challenge with utilization of these indices is that their values are dependent on sample size; while approaches exist that reduce this dependence (e.g., Wagenmakers & Farrell, Reference Wagenmakers and Farrell2004), common usage hinges on values that are non-portable due to this sample size dependence.

A different approach involves examination of the contingency tables. For a test with n dichotomous items, there are 2 n \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2^n$$\end{document} possible response patterns. A complete comparison of the observed versus expected responses of each pattern would be computationally burdensome for even moderate n, but approaches emphasizing lower-dimensional summaries are useful (Maydeu-Olivares and Joe, Reference Maydeu-Olivares and Joe2005). Resulting statistics such as the M 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M_2$$\end{document} can also be converted to root-mean-square error of approximation (RMSEA)-type indices that also emphasize model parsimony (Maydeu-Olivares, Reference Maydeu-Olivares2013). The RMSEA is not sample size dependent and is useful as a measure for overall fit of a model to the data but is challenging to use for purposes of comparison, as differences in RMSEA are sensitive to the size of the initial model (in degrees of freedom) which may lead to an inability to detect misfit in differences between large models (Savalei et al., 2021). In this paper, we compare the IMV to the AIC and RMSEA. These quantities differ in key ways, but they are all regularly used for the purpose of evaluating the performance of the kinds of models we consider here and they are thus useful for the purpose of evaluating the IMV. However, the IMV is also informed by recent shifts in thinking on the importance of prediction. We discuss these shifts below.

1.2. From Explanation to Prediction

Traditionally, fit indices and model parameters have been computed based on the same data. Historical limitations on computation frequently mandated such approaches and, of course, indices were often designed with such limitations in mind (e.g., the AIC penalty for overfitting based on the number of estimated parameters, Stone, Reference Stone1977). More recent thinking, however, emphasizes the advantages of evaluating fit based on out-of-sample data in psychology (Yarkoni and Westfall, Reference Yarkoni and Westfall2017), across the social sciences more broadly (Verhagen Reference Verhagen2022; Wolfram et al., 2022), and into the computational sciences more generally (Savcisens et al., 2023). This rapidly occurring (Rahal et al., 2022) change in perspective is tied to criticisms that social science research has historically been too narrowly focused on finding causal mechanisms based on an in-sample analysis of association-based models applied to observational data (Shmueli, Reference Shmueli2010). As one example from psychological measurement of the gains such approaches may offer as compared to conventional in-sample studies, out-of-sample approaches may allow for improved identification of dimensionality in factor analysis settings (Haslbeck and van Bork, Reference Haslbeck and van Bork2024).

We agree with arguments that such a consideration of prediction is essential for improving our theoretical understanding even when there is no inherent interest in prediction itself (Watts et al., 2018; Hofman et al., Reference Hofman, Watts, Athey, Garip, Griffiths and Kleinberg2021). The move to prediction—not to be conflated with the use of highly bespoke models for use in “forecasting” exercises (Watts, Reference Watts2014)—allows us to provide improved insight into model fit, to construct bench-marking tools across modeling domains, and to generate insight into the behavior of complicated models. Full enjoyment of these benefits may require the use of novel indices for understanding the performance of predictive models above and beyond conventional indices that are perhaps not well suited to this purpose.

Thus, prediction of out-of-sample data has begun to emphasize indices specifically designed for such purposes. An early example of this kind of analysis in psychometric settings emphasized a version of the out-of-sample log-likelihood as performing better than many alternatives (Kang and Cohen, Reference Kang and Cohen2007).Footnote 1 A more recent paper (Stenhaug and Domingue, 2022) introduced one insight that is key for our purposes. Frequently, analysis of out-of-sample data for purposes of model selection has focused on which approaches allow one to identify the data generating model. In contrast, they (Stenhaug and Domingue 2022) argue that we should instead be asking which models are maximally predictive of out-of-sample data. The more predictive model should be favored, irrespective of whether this model is also the data generating model. Although we would clearly expect them to be in some cases, there are other cases where the data generating model may fare poorly for prediction (e.g., a highly complex model may generate poor predictions relative to a simpler alternative if there is insufficient data for precise estimation of the many parameters of the complex model). In the sense that it is designed to gauge the quality of prediction in out-of-sample tests, the IMV is a “predictive index.” This nomenclature is introduced so as to distinguish the IMV and its computation from more conventional indices such as those discussed above; this kind of predictive index is meant to help in quantifying modeling progress (Watts et al., 2018) with the ultimate goal of providing a “solution-oriented” approach to social science (Watts, Reference Watts2017).

2. The InterModel Vigorish for IRT Models of Dichotomous Outcomes

2.1. The InterModel Vigorish

The IMV was first introduced in the context of generic dichotomous outcomes; we briefly describe its computation (for additional details, see Domingue et al., 2021) before moving to a discussion of its use in IRT settings. Consider the likelihood assigned by some model to each predicted outcome p i ( 0 , 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_i \in (0,1)$$\end{document} for some Bernoulli random variable y i { 0 , 1 } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$y_i \in \{0,1\}$$\end{document} (with i { 1 , . . . , n } ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i \in \{1,...,n\})$$\end{document} :

(1) L i = p i y i ( 1 - p i ) 1 - y i . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} L_{i}=p_i^{y_i}(1-p_i)^{1-y_i}. \end{aligned}$$\end{document}

We can summarize these via the geometric mean of the likelihoods

(2) A = i = 1 n L i 1 n . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} A=\left( \prod _{i=1}^n L_i \right) ^{\frac{1}{n}}. \end{aligned}$$\end{document}

The IMV is based on a sequence of bets involving coins; we now note how these coins are identified before describing their usage. We identify a coin of weight w via a calculation involving A; specifically, for a predictive system that leads to A, we find w [ 0.5 , 1 ] \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w \in [0.5,1]$$\end{document} such that

(3) w log ( w ) + ( 1 - w ) log ( 1 - w ) = log A . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} w\log (w)+(1-w)\log (1-w)=\log A. \end{aligned}$$\end{document}

The coin with weight w has uncertainty equivalent to that of the full set of predictions p i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_i$$\end{document} of some outcome; a weight close to w = 0.5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w=0.5$$\end{document} indicates a predictive system with high levels of uncertainty, whereas a coin with weight close to w = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w=1$$\end{document} indicates predictions that are much more deterministic.

Suppose we now have predictions of the outcomes y from two models; we will denote these predictions as p 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_0$$\end{document} and p 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_1$$\end{document} (omitting the i subscript). Using Eqn 3, we identify coins w 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_0$$\end{document} and w 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_1$$\end{document} . A fair bet (in the sense that neither side expects profit) is established based on w 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_0$$\end{document} ; one player bets $1 on the positive outcome, while a second bets 1 O \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{O}$$\end{document} on the negative outcome ( O = w 0 1 - w 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O=\frac{w_0}{1-w_0}$$\end{document} ). Unbeknownst to the player betting on the negative outcome, the coin of weight w 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_0$$\end{document} is replaced with a coin of weight w 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_1$$\end{document} . If w 1 > w 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_1>w_0$$\end{document} (i.e., predictions p 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_1$$\end{document} are better than those of p 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_0$$\end{document} ), the player betting on the positive outcome stands to gain. The IMV is this gain; it is based on the expected profit for the player betting on the positive outcome if w 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_0$$\end{document} is replaced with w 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_1$$\end{document} . In that case, the person betting on the positive outcome now has additional information and expects to win

(4) IMV w 1 - w 0 w 0 . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \text {IMV}\equiv \frac{w_1-w_0}{w_0}. \end{aligned}$$\end{document}

This quantity is the expected profit associated with the side information contained in the p 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_1$$\end{document} prediction that is only available to one party in a bet, while the other party only has information contained in the p 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_0$$\end{document} prediction.

We pause to emphasize one crucial fact. Given that the calculations in Eqn 3 are based on the unadjusted likelihood, the IMV will be biased in favor of more complex models when evaluated in-sample. We thus rely on computation of the IMV in data not used for model estimation throughout.

The IMV has several favorable properties. First, it is a generalizable metric that is comparable across different data and models and, thus, can be used to generalize results across applications. Generalizability is ensured given that the IMV is always conditioned on the fair bet involving w 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_0$$\end{document} . Second, given that IMV always requires predictions from two approachesFootnote 2, it naturally indexes change between the approaches. However, there is a natural null model (i.e., the outcome’s prevalence) that makes the IMV appropriate for evaluating the performance of a single predictive approach. Third, the IMV requires only predictions from two models about data; there are few additional restrictions. It can thus be used to make a variety of comparisons: We make comparisons across different item response models applied across different datasets, but it can also be used to make comparisons between different structural conditions (e.g., sample size), estimation strategies, and even between truth and estimates. We attempt to illustrate this flexibility—e.g., with use of the “Oracle” analysis introduced below meant to capture the last point—throughout the remainder of the paper. Fourth, values of the IMV can be interpreted straightforwardly as real numbers given their derivation. For example, if an IMV value from one predictive exercise is 10 times the value of the IMV from another, we can say that the information in the first exercise provides an order of magnitude more predictive value than that in the second exercise. We now discuss the extension of the IMV approach to an IRT framework.

2.2. The IMV for IRT Models with Dichotomous Outcomes

When considering dichotomous item responses, application of the IMV is a fairly straightforward extension of the approach described above. Suppose we want to evaluate the fit of an IRT model to the dichotomously coded item response x ij { 0 , 1 } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x_{ij} \in \{0,1\}$$\end{document} of person i { 1 , , N } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i \in \{1,\dots ,N\}$$\end{document} to item j { 1 , , J } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j \in \{1,\dots ,J\}$$\end{document} . We consider item response models that describe the probability of a correct response for person i to item j,

(5) Pr ( x ij = 1 ) p ij . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} \Pr (x_{ij}=1)\equiv p_{ij}. \end{aligned}$$\end{document}

If, for example, we are considering the 3PL (Lord and Novick, 1968) then the predicted probability of a correct response is modeled as a function of person ability θ i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} , item difficulty b j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j$$\end{document} , item discrimination a j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_j$$\end{document} , and guessing parameter c j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_j$$\end{document} :

(6) p ij = c j + 1 - c j 1 + exp ( - a j ( θ i - b j ) ) . \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} p_{ij}=c_j+\frac{1-c_j}{1+\text {exp}(-a_j(\theta _i-b_j))}. \end{aligned}$$\end{document}

We now introduce subscripts to denote probabilities from different approaches (and suppress i and j subscripts for readability). The vector of response-level probabilities, p 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_1$$\end{document} , for the model of interest is constructed relative to some baseline model whose probabilities we denote as p 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_0$$\end{document} . The value p 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_0$$\end{document} could be the predicted probability of a correct response from an alternative item response model—e.g., the 1PL model ( j , a j = 1 , c j = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall j, a_j=1, c_j=0$$\end{document} ) or the 2PL ( j , c j = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\forall j, c_j=0$$\end{document} )—if p 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_1$$\end{document} is based on the 3PL. As a baseline, we can even consider simpler alternatives such as the overall mean ( x ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x}$$\end{document} ) or item-specific predictions that ignore information about the respondent ( x j ¯ 1 N i = 1 N x ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}\equiv \frac{1}{N}\sum _{i=1}^N x_{ij}$$\end{document} , i.e., the item p value from classical test theory, Crocker & Algina 1986).Footnote 3

Alongside predictions p 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_0$$\end{document} and p 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_1$$\end{document} , the computation of the IMV requires a specific set of outcomes but is flexible in that any set of outcomes/predictions are sufficient. If outcomes are denoted as x, then we denote the IMV metric as IMV ( p 0 , p 1 ; x ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {IMV}(p_0,p_1;x)$$\end{document} . Here, we present out-of-sample predictions averaged across all items but emphasize this flexibility upfront. In computation of the IMV, we will use data x that are not included in the process of estimating p 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_0$$\end{document} or p 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_1$$\end{document} (i.e., x is out-of-sample or test data).Footnote 4 We occasionally denote such data as x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^\star $$\end{document} when we want to emphasize that it is out-of-sample but retain the simpler x notation here to emphasize that the basic idea does not require out-of-sample data.

The IMV metric has various advantages. First, it can be used as an index in relative isolation (comparing model-based predictions to, for example, overall difficulty or item-specific difficulty) or as a comparison between two more sophisticated models. Second, when used in this second, relative sense, it offers great flexibility in the choice of comparison model (which does not need to be nested, and may also use different estimators). Third, being standardized as the expected profit of a bet, it is comparable across different models and data sets. In the following section, we discuss the performance of the IMV in a range of simulation studies but also offer simple illustrations of how to compute the IMV using both simulated and empirical data (see SI-S1).

3. The IMV in Simulation Studies

In this section, we illustrate the performance of the IMV in the context of dichotomous item responses using simulation studies. We specifically (1) evaluate how the IMV behaves under model misspecification, (2) demonstrate how it can be used to study overfitting and predictive accuracy, and (3) assess sensitivity to sample size. We also (4) contrast the behavior of the IMV to that of alternative metrics. Finally, we (5) contrast the range of IMVs computed in the simulation studies.

Note the following details regarding the simulation studies:

  • Unless otherwise noted, we simulate data x via Eqn 6 with log a j Normal ( 0 , σ 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log a_j\sim \text {Normal}(0,\sigma ^2)$$\end{document} , b j Normal ( 0 , 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j \sim \text {Normal}(0,1)$$\end{document} , and c j Unif ( 0 , C ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_j\sim \text {Unif}(0,C)$$\end{document} for N = 1000 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=1000$$\end{document} respondents and varying numbers of items N j { 10 , 25 , 50 , 200 } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N_j \in \{10, 25, 50, 200\}$$\end{document} . We sample abilities θ i Normal ( 0 , 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i \sim \text {Normal}(0,1)$$\end{document} . For each condition, we generate 100 data sets.

  • Item response models are estimated with mirt (Chalmers, Reference Chalmers2012). Item parameters are estimated via the expectation–maximization (EM) algorithm; person abilities are estimated via expected a posteriori (EAP). So as to ensure convergence in the case of small samples, where applicable we estimate item parameters using a lognormal prior with parameters (0,1) for discriminations and a beta prior for guessing with parameters (2, 17).Footnote 5

  • We compute the IMV using out-of-sample responses. In simulations, we produce a test dataset by sampling a new set of item responses, x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^\star $$\end{document} , based on the true p ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ij}$$\end{document} values. That is, we generate new responses from the same probability distribution used to generate the training data from which model estimates are derived; note that we thus compute the IMV based on the same amount of data used for estimation. Predictions associated with these out-of-sample responses are based on the estimated item- and person-level parameters for a given models used for estimation.

Note that we use a similar approach to estimation in our empirical work discussed in Sect. 4.

3.1. Prediction and Model Misspecification

We begin by investigating the performance of the IMV in analyzing estimates from models that do not necessarily have the same form as the generative model. To that end, we simulated data x using Eqn 6 (based on manipulation of C { 0 , . 3 } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C\in \{0,.3\}$$\end{document} and σ { 0 , . 25 , . 5 } \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma \in \{0,.25,.5\}$$\end{document} ). All conditions were fully crossed. Based on the true response-level probability of a correct response, we generate an equivalently sized holdout sample x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^\star $$\end{document} of responses. We then fit the 1PL, 2PL, and 3PL models to the data in x, obtaining estimates of p ij ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}}$$\end{document} from each, and then compute IMVs for different combinations of predictions of x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^\star $$\end{document} . In particular, we consider IMV(1PL,2PL; x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^\star $$\end{document} ) and IMV(2PL,3PL; x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^\star $$\end{document} ) as a function of the σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} and C values. We hypothesize, for example, an increase in IMV(1PL,2PL; x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^\star $$\end{document} ) as σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} increases. We are able to make explicit statements about this hypothesis by quantifying the value associated with fitting more complex models.

Results are presented in Fig. 1 for varying numbers of items. We begin by first considering the blue points which show IMV(1PL, 2PL). As expected, the 2PL provides increasing value as σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} increases relative to the 1PL. For σ = 0.5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =0.5$$\end{document} , the IMV is between 0.01 and 0.015. We provide IRT-specific context for these values in subsequent sections; but, as an initial benchmark, evidence from other settings suggest that the move from the 1PL to 2PL is as valuable—in IMV terms—as, for example, information about age in prediction of chronic disease among older people (Domingue et al., 2021). We view these values as evidence that the IMV is clearly able to detect scenarios wherein an overly restrictive model is being used as compared to a model that generates more accurate predictions. Moreover, this detection is quantified in a way that is portable. Given this portability, the gains in going from the 1PL to the 2PL can be compared directly to subsequent gains we observe from other kinds of modeling innovation.

Figure 1 The cost of misfit: IMV values for the 3PL relative to the 2PL (red) and the 2PL relative to the 1PL (blue). Points are averages across 100 datasets for each configuration of parameters (1000 respondents in all cases); line segments represent span of 0.025 to 0.975 quantiles over the 100 datasets. The “(N,M)” in parentheses are shorthand for the N and M parameter logistic IRT models.

We now make such a comparison by looking at gains associated with going from the 2PL to the 3PL. Consider the red points in Fig. 1, which show IMV(2PL,3PL). In contrast to the results shown for IMV(1PL,2PL), the 3PL never provides much additional value relative to the 2PL irrespective of the value of C. Average IMV values are never greater than 0.001. While the evaluation of the IMV(1PL,2PL) values suggested that the IMV was sensitive to that modeling change, consideration of IMV(2PL,3PL) suggests that the IMV is able to detect when adding model complexity does not improve predictions of new data. In SI-S2.2, we show that the IMV(2PL,3PL) values remain small even when guessing is more pronounced. This is due to previously identified problems related to the identification of 3PL model parameters (Maris and Bechger, Reference Maris and Bechger2009; Haberman, Reference Haberman2005; von Davier, 2009) and is a topic we return to in Sect. 5.

In the above, we consider comparisons between different IRT models. One advantage of the IMV is that it allows comparison across very different types of models that do not need to be from the same family of models. We illustrate this by also considering non-IRT mechanisms for generating p 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_0$$\end{document} or p 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_1$$\end{document} ; for example, we examine IMV( x j ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}$$\end{document} ,1PL), where x j ¯ = i x ij / N \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j} = \sum _{i} {x_{ij} / N}$$\end{document} represents the response probability as calculated by the in-sample proportion of correct responses for each item (i.e., the classical item p value). Note that this probability is constant across persons within each item and describes the value of the 1PL versus predictions that do not account for between-person differences in ability. We can similarly consider IMV( x ¯ , x j ¯ ; x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x},\bar{x_j};x^\star $$\end{document} ), where x ¯ = i , j x ij / N \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x}= \sum _{i,j} {x_{ij} / N}$$\end{document} represents the mean response across all items and people. This quantity describes the value of predictions that account for item-level differences in difficulty as compared to a universal prediction based on overall difficulty alone.

The average IMV( x j ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}$$\end{document} ,1PL; x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^\star $$\end{document} ) across all iterations in Fig. 1 was 0.1. Similarly, the average IMV( x ¯ , x j ¯ ; x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x},\bar{x_j};x^\star $$\end{document} ) across all iterations was 0.26. Note that the IMV values from this simulation are maximal in the sense that they would be smaller if E ( θ i - b j ) 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}(\theta _i-b_j) \ne 0$$\end{document} ; we illustrate this fact in SI-S2.1. These IMVs indicate that the gains associated with the inclusion of item-level variation in the discrimination parameter relative to difficulty alone are an order of magnitude less valuable than allowing variation in p ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ij}$$\end{document} after considering both item- and person-parameters.

3.2. Evaluation of Model-Based Predictions Versus Truth and a Consideration of Overfitting

We now evaluate the balance between prediction accuracy and overfitting. To do this, we compare model-based predictions to the true probabilities. Predictions from in-sample data suffer from overfitting which we show by computing IMV values tailored to index overfitting using in-sample data. Predictions on out-of-sample data do not have this problem and are used to quantify the value that absolute truth (i.e., predictions from an oracle) has versus model-based predictions.

Formally, we compute Oracle values as IMV( p ij ^ , p ij ; x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}},p_{ij};x^\star $$\end{document} ); that is, we are asking about the value that would be associated with knowing the true p ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ij}$$\end{document} value relative to our estimate p ij ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}}$$\end{document} for out-of-sample data. Similarly, Overfit is defined as IMV( p ij ^ , p ij ; x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}},p_{ij};x$$\end{document} ). As with the Oracle, we are again asking about the value associated with knowing the true p ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ij}$$\end{document} quantities relative to the estimates based on an IRT model. The key distinction is that for the Overfit we are computing IMVs based on in-sample data (x rather than x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x^\star $$\end{document} ). If the Overfit value is below zero, this implies that the estimates are better predictors than the truth (a clear indication of overfitting). We will use p ij ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}}$$\end{document} derived from the 1PL, 2PL, and 3PL models. We emphasize that these Oracle and Overfit values, as compared to the quantities such as IMV(1PL,2PL) considered in Fig. 1, are only available given that we are working with simulated data and thus know truth (i.e., p ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ij}$$\end{document} ).

The results are shown in Fig. 2. Consider the 1PL: For these estimates, there is increasing value in the Oracle as σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} and C increase. In contrast, the Oracle is positive but relatively constant across σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} and C for the 2PL and 3PL, thus suggesting that it is the ability to fit the changing discrimination parameters that is resulting in valuable gains in estimates of item-level response probabilities. This dovetails with the previous observation regarding the fact that IMV(2PL,3PL) is very near zero irrespective of C. The fact that, for a given value of C, we see no change in the Oracle for the 2PL across values of σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} indicates that it is the flexibility associated with fitting varying discrimination parameters, not the level of variation in those parameters, that leads to predictive value from the 2PL relative to the truth (of course, IMV(1PL,2PL) increases as a function of σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} , see Fig. 1).Footnote 6 Note that the value of the Oracle IMV when σ = C = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =C=0$$\end{document} is similar across all three IRT models (i.e., the points in each panel overlap) but depends on the number of items (i.e., it is near 0.02 for 25 items but less than 0.01 for 200 items); we further discuss this dependency on sample size below.

Figure 2 Oracle and Overfit values computed for simulations in Fig. 1. Solid dots represent Oracle values where the number indicates the IRT model (e.g., 3 indicates the 3PL). Hollow dots indicate Overfit values.

Turning to the Overfit values, we would generally expect better prediction when we have access to the true p ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ij}$$\end{document} values as compared to the estimated values p ij ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}}$$\end{document} . However, note that the Overfit values in Fig. 2 are generally negative. This confirms that the p ij ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}}$$\end{document} values are overly tailored to x. The penalty is relatively constant for the 2PL and 3PL as a function of σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \sigma $$\end{document} and C, but the magnitude of the penalty depends on the number of items and is nearly zero for the largest number of items (i.e., estimates of p ij ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}}$$\end{document} are nearly as good as p ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ij}$$\end{document} ). There are interesting features of the 1PL Overfit estimates. Consider the 25 item case. When the true model is effectively the 1PL (i.e., σ = C = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =C=0$$\end{document} ), all three IRT approaches see an expected Overfit IMV of nearly - \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 0.02. However, as σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} and C increase, the penalty associated with the 1PL model declines to nearly 0. A similar story holds as we increase the number of items and, in fact, the Overfit associated with the 1PL is actually positive when σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} and C are relatively large, suggesting that the true p ij \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{ij}$$\end{document} values are more predictive than the estimated p ij ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}}$$\end{document} values. This indicates that the 1PL is more robust to Overfitting than the 2PL and 3PL when the data generating model is relatively complex but, of course, this comes at the cost of being overly restrictive in predicting outcomes when the 1PL is not the correct model (especially when σ > 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma >0$$\end{document} ).

3.3. Prediction Accuracy as a Function of Sample Size

When appropriate models are applied, larger samples should allow for more accurate estimation of model parameters. We thus use the IMV to index changes in predictive value as a function of sample size. We extend the above simulations to allow for varying numbers of respondents (up to 10,000). Results can be found in the supplementary materials (SI-S2.5). We demonstrate that the (IMV-derived) costs of misfit are not strongly sensitive to sample size except for the case of the IMV(1PL,2PL) if the data generating model is the 2PL or 3PL (SI-S2.5) which doubles from around 0.005 for 100 respondents to over 0.01 when there are several thousand respondents. These results are consistent with the notion that more respondents allow for more accurate estimates of slope parameters but do not translate into more accurate guessing parameters. Note that we are not arguing that the IMV’s value is sensitive to sample size in a way that invalidates comparisons; rather, predictions are improved (in some cases) with larger samples and these improved predictions lead to larger IMVs.

In the SI, we further explore the sensitivity of the Oracle values to sample size (SI-S2.6); we note two key findings. First, the Oracle values behave as expected as a function of sample size (i.e., they move toward zero for larger number of respondents and a fixed number of items). Second, there is a lower bound on the value of the Oracle for a fixed number of items. That is, increasingly large samples do not further generate value for a fixed number of items. This is because the p ij ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}}$$\end{document} estimates cease to become more accurate given that the precision of ability estimates is limited by the sample size of items.

3.4. A Comparison of the IMV to Alternative Metrics

If interest is in a simple up/down decision about one model or another, we anticipate that commonly used metrics will provide similar information under many common scenarios. This is desirable; the IMV will typically provide the same information as other metrics if interest is solely in that decision (see simulations in Domingue et al., 2021). However, the IMV provides qualitatively different information if the focus is on the questions of “how much better?” rather than strictly “which is better?”. We offer an illustration of this point focusing on commonly used indices in IRT settings with an emphasis on both the portability of the IMV and the fact that it is indexing prediction quality. These results, when joined with those from earlier work (Domingue et al., 2021), suggest that the IMV offers novel information compared to conventional metrics. We consider comparisons to the AIC (Akaike, Reference Akaike1973)—specifically, the difference in the AIC (Burnham and Anderson, Reference Burnham and Anderson2004) between sequential models (i.e., AIC 1PL - AIC 2PL \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {AIC}_\text {1PL}-\text {AIC}_\text {2PL}$$\end{document} , AIC 2PL - AIC 3PL \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {AIC}_\text {2PL}-\text {AIC}_\text {3PL}$$\end{document} )—and the RMSEA based on the M2 statistic (Maydeu-Olivares et al., Reference Maydeu-Olivares, Cai and Hernández2011). The AIC is widely used for model selection and involves a penalty based on the number of estimated parameters. Note that AIC does not depend on the number of estimated person abilities. The RMSEA considers parsimony by capturing the amount of model misspecification per degree of freedom (Browne and Cudeck, Reference Browne and Cudeck1992); for comparisons to the other indices, we consider differences in RMSEA values. We also take advantage of the simulated setting to consider RMSEs between true and estimated probabilities. While it would not be useful in empirical settings where truth is not known, the behavior of the RMSEs offers a valuable benchmark here in that it helps to calibrate our understanding of the other metrics. In particular, we use the RMSE as our benchmark for gauging changes in predictive accuracy.Footnote 7

We base simulations on the 2PL (Eqn 6 with c j = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_j=0$$\end{document} ). We sample b j Normal ( 0 , 1 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j \sim \text {Normal}(0,1)$$\end{document} and log a j Normal ( 0 , 0 . 3 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log a_j\sim \text {Normal}(0,0.3^2)$$\end{document} . We focus on the implications of sample size. Results for this first simulation study are shown in Fig. 3; we separately vary the number of items (top; N Unif ( 10 , 50 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N \sim \text {Unif}(10,50)$$\end{document} ) and the number of respondents (bottom; N Unif ( 100 , 1000 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N \sim \text {Unif}(100,1000)$$\end{document} ). Consider first the root mean squared errors (RMSE; in the left panel), which contrasts estimates from a given model with true probabilities used to generate responses; the RMSE would not be available in practice but is useful here given that it allows us to benchmark the behavior of the various metrics to the RMSE’s comparisons of estimates to truth. The RMSE is smallest in absolute terms for the 2PL—given that the 2PL was used to generate the data the 3PL is overfit to the in-sample data (see also results from Fig. 1 when σ = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =0$$\end{document} )—with the 1PL estimates being worse in larger samples. Note also that there is some increase in the difference in the RMSE for the initial growth in sample (either items or persons) but this growth seems to level out (especially as a function of the number of people) for larger sample sizes. A metric that is sensitive to improvements in prediction accuracy should behave similarly.

Figure 3 Simulations comparing a variety of metrics (in columns; along with RMSE as compared to the true/known probabilities used to generate item responses) for 1/2/3PL estimates (shown as different colors). Data are generated via the 2PL based on different numbers of items (top; 1000 respondents) or different numbers of people (bottom; 50 items). Solid lines indicate comparisons; in the first row, the dashed lines indicate raw RMSEs for the 3 models. Results are based on LOESS smoothing for 1000 choices of the component of sample size being varied (top, items; bottom, people).

Turning to the metrics that are computed based on estimated quantities (rather than true probabilities), the IMV(1PL,2PL) increases as sample size increases while the IMV(2PL,3PL) is negative but quite small and fairly insensitive to changes in sample size. Both of these results are anticipated given the RMSE results. The IMV is sensitive in that increases in sample size lead to improvements in predictive accuracy. Suppose we go from 200 to 500 respondents being used to estimate model parameters, this increase results in increases in accuracy for the 2PL relative to the 1PL that we can observe in the RMSE and we similarly observe increases in the IMV. Larger sample sizes (i.e., going from 500 to 1000) do not yield tangible differences in predictive accuracy (again see the RMSE) and the IMV is similarly flat.

For the RMSEA and AIC values, we focus interpretation on the areas of difference. The RMSEA does little to capture the divergence in 1PL and 2PL/3PL predictions as the number of respondents increases (the RMSEA also misidentifies the generating model in SI-S2.7). The AIC’s sensitivity to the number of estimated item parameters is apparent in the upper right panel; rather than level out as do the RMSE and IMV values the AIC differences are largely linear. These panels help to illustrate the issue of portability associated with the AIC. The RMSE values, for example, are always comparable for the 2PL and 3PL irrespective of sample size. However, the AIC differences comparing the 2PL and 3PL results vary from near 0 to nearly - 200 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-200$$\end{document} depending on the relevant sample sizes. In contrast to what we observed with the IMV, the RMSEA is not sensitive to changes in predictive accuracy (note the flatness of the curves in the bottom panel) and the AIC is not portable (the AIC grows linearly as a function of the sample size irrespective of the predictive gains suggested by the RMSE). In SI-S2.7, we further illustrate these points of difference between the IMV and AIC/RMSEA by focusing on variation in μ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document} when b j Normal ( μ , 1 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$b_j \sim \text {Normal}(\mu ,1^2)$$\end{document} .

3.5. Synthesizing the Simulation Results

We synthesize IMV values from the simulation studies in Table 1. These provide generic guidance about the kind of increase in predictive performance that comes from different modeling innovations. We offer them for the purposes of helping to develop intuition about the degree to which modeling choices affect predictive accuracy. When the data generating model is a 1PL, adoption of item-level variation in prediction is incredibly valuable, IMV ( x ¯ , x j ¯ ) = 0.3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {IMV}(\bar{x},\bar{x_j})=0.3$$\end{document} , relative to prediction based on the overall p value alone. Prediction based on the 1PL leads to IMV ( x j ¯ , 1PL ) = 0.1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {IMV}(\bar{x_j},\text {1PL})=0.1$$\end{document} ; incorporation of person-level variation is, not surprisingly, quite useful for improving predictions even after item-level variation has been included.

Turning now to data generated from more complex models, prediction based on the 2PL rather than the 1PL is an order of magnitude less valuable, IMV ( 1PL , 2PL ) = 0.01 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {IMV}(\text {1PL},\text {2PL})=0.01$$\end{document} , than IMV ( x j ¯ , 1PL ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {IMV}(\bar{x_j},\text {1PL})$$\end{document} . This value depends on a choice of σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} . Here, we focus on results for σ = 0.5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =0.5$$\end{document} which corresponds to a j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_j$$\end{document} parameters that whose 10% and 90% percentiles range from 0.53 to 1.88; different choices of σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} lead to different IMV values (see Fig. 1), and we return to this point in our discussion of empirical results. As an additional point of comparison, we can compute the IMV based on different approaches to generating ability estimates. After computing both MLE and EAP estimates, we can compute IMV(MLE,EAP), see SI-S2.3. Predictions based on the EAP versus the MLE are nearly an order of magnitude less valuable still, IMV(MLE,EAP)=0.002. Transitioning from the 2PL to the 3PL as the generative model, recovery via the 3PL has an IMV much smaller than the switch in ability estimation methods, IMV(2PL,3PL)=0.0005.

Table 1 Approximate expected IMVs for different modeling scenarios.

a Data generated via Eqn 6 with a j = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_j=0$$\end{document} , c j = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c_j=0$$\end{document} , E ( b ) = 0 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}(b)=0$$\end{document} , and Var ( b ) = 1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {Var}(b)=1$$\end{document} .

b Data generated via Eqn 6 with σ = 0.5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =0.5$$\end{document} and C = 0.3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$C=0.3$$\end{document} .

Results based on simulated item response data for 50 items and 1000 respondents using the appropriate generating model.

To emphasize the portability of the IMV, we can also compare these values to those generated in non-IRT settings (Domingue et al., 2021). The value of allowing for item-level variation in predictions ( IMV ( x ¯ , x j ¯ ) = 0.3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {IMV}(\bar{x},\bar{x_j})=0.3$$\end{document} ) is similar to the value of demographics in predicting the political affiliation of US adults in 1991. Moving to the 1PL ( IMV ( x j ¯ , 1PL ) = 0.1 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {IMV}(\bar{x_j},\text {1PL})=0.1$$\end{document} ) is akin to the value that self-reported symptoms (e.g., loss of taste) provided in predicting COVID infections early in the outbreak. Relative to the 1PL, the 2PL provided predictive value on the order of what age and sex provide in predicting high blood pressure among respondents near 63y of age. Such comparisons are useful in helping us understand the general utility of modeling improvements made in IRT by allowing us to contrast them with the predictive gains observed in other contexts.

We believe the values in Table 1 have implications for application. For example, there is a relatively large literature on the sample size requirements of the 3PL (see discussion in Feuerstahler, 2020). In our view, the difference between the 2PL and 3PL is fairly negligible in terms of the value offered by the predictions even when the 3PL is the true data generating model. This is not to say that estimates from the 3PL can never help to identify items that have issues associated with guessing or that such information would not be valuable; rather, we are asserting that, if interest is in the quality of the resulting predictions, the difference between the 2PL and 3PL is likely negligible.

4. The IMV in Empirical Data

In order to evaluate how different models improve prediction of item responses in practice, we apply the IMV approach to 89 dichotomously scored item response datasets taken from the IRW (Domingue and Kanopka, 2023). These data span a range of cognitive and affective tasks. We consider the broad range of data for the purpose of both illustrating common trends related to variation in prediction quality and also studying interesting exceptions revealed by the IMV values. The median dataset has 1500 respondents (range 118–10,000) and 36 items (range 4–529). For a description of each dataset and its associated results, please see SI-S3.

We focus on analysis of IMVs computed for various modeling approaches and use the quantities in Table 1 as a means of understanding the relative magnitude of these values. We use the same analysis pipeline as with simulated data with a few exceptions. To minimize the computational costs, we consider a random sample of 10,000 respondents in datasets with large numbers of respondents. To compute the IMV, we implement the “missing response” paradigm (Stenhaug and Domingue, 2022) by splitting the response-level data at random into k = 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=4$$\end{document} folds.Footnote 8 For a given fold, we use all responses not in that fold to produce estimates of ability and item parameters and then combine those to generate predictions for the responses in the fold. The same priors as in the simulations were used here for estimation. IMVs are computed based on those predictions; we take the average IMV across the folds.

Figure 4 IMVs computed using different models with 89 empirical datasets. Gray lines are similarly placed in each figure to emphasize comparability across results (and average IMVs for each approach are also shown). Left: IMV(CTT,1PL) as a function of the average response x ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x}$$\end{document} in a given dataset. Right: IMVs contrasting a range of IRT models.

We begin by comparing the predictions from the 1PL model to those that use simply the item-level mean (i.e., x j ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}$$\end{document} ). A visualization of the IMVs as a function of the average response x ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x}$$\end{document} is shown in Fig. 4; correlations between dataset descriptive statistics and IMV results are shown in Table 2. The average value of IMV( x j ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}$$\end{document} ,1PL) was 0.075; note that this is less than but in proximity to the 0.1 value from simulation studies (e.g., Table 1 when the 1PL was the true model). There are some extreme values; the largest value (IMV( x j ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}$$\end{document} ,1PL)=0.47) is for data from a four-item attitudinal survey regarding abortion (see Rizopoulos, Reference Rizopoulos2006).

We emphasize a few additional points related to these IMV( x j ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}$$\end{document} ,1PL) values. First, as expected, IMVs are generally larger for datasets that have items with correct response rates near 50% ( r = - 0.42 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=-0.42$$\end{document} , see Table 2). This is due to the fact that the level of uncertainty for the Bernoulli random variable varies with prevalence; the IMV is related to uncertainty and there is simply less uncertainty when prevalences are far from 0.5 for models to explain. That said, for a given average response level, there is still variation in the IMV thus indicating that the IMV is sensitive to prediction quality on top of overall difficulty. Second, the IMVs are effectively independent of the number of people (correlation of - \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 0.07) and only weakly associated with the number of items ( - \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document} 0.22); we offer a potential explanation for this negative correlation with the number of items below. Third, to indicate the flexibility of the IMV, we also compare estimates from the 1PL model to an alternative baseline: that of the Guttman model (Guttman, 1950).Footnote 9 The expected payoffs in this case are quite large (average IMV(Guttman,1PL) = 0.53 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$=0.53$$\end{document} ) which is consistent with arguments regarding the utility of probabilistic item response models (Sijtsma, Reference Sijtsma2012).

Table 2 Correlations between IMV values (for 1PL, 2PL, 3PL, and 2F-2PL models) and key dataset descriptives (numbers of people and items; | . 5 - x ¯ | \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|.5-\bar{x}|$$\end{document} where x ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x}$$\end{document} is the average response for a dataset) for empirical analyses.

a One dataset had over 500 items; correlations with number of items is computed with this dataset removed.

Turning to more complex models beginning with the 2PL, we observe an average IMV(1PL,2PL)=0.006, an order of magnitude smaller than the mean IMV( x j ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}$$\end{document} ,1PL). The IMV associated with moving from the 1PL to the 2PL was somewhat larger in simulation studies (IMV(1PL,2PL)=0.01) than the average here but recall that this quantity depended on a choice of σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} ; we view the proximity of these empirical results to the value observed in simulations as supportive of the IMV values associated with σ = 0.5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =0.5$$\end{document} that we focus on in Table 1. There is variation in these IMV(1PL,2PL) quantities (max of 0.068 for data from toddlers on balance-problem items; Van Maanen, Been, & Sijtsma, 1989); note that there was also significant variability in the simulation-based IMV(1PL,2PL) values of Fig. 1. The average IMV(2PL,3PL) is 0.00016 with a max of 0.0046 (data from a literacy intervention; Gilbert, Kim, & Miratrix, 2023); these results suggest weak improvements in predictive value for the 3PL relative to the 2PL as was observed in simulation studies.

As a contrast to the 3PL results, we also consider a fully exploratory two factor 2PL (2F-2PL). Data restrictions led to analysis in only 60 datasets.Footnote 10 The average IMV(2PL,2F-2PL) was 0.0031; this is somewhat larger than the IMV(2PL,3PL) results but there was also more variability in the right tail. The maximal value was 0.029 (data from a personality inventory; Eysenck & Eysenck, 1968); indeed, many of the cases wherein the 2F-2PL offered large predictive increases were based on personality inventories that are conventionally assumed to be multidimensional. We thus argue that the 2F-2PL is a modeling innovation that is able to produce tangible gains in predictive value in some tailored cases; the 3PL only provides—at best—very weak predictive value. When combined with the simulation evidence (e.g., Figure SI-S2.2), we are pessimistic about the utility of fitting the 3PL.

We offer two additional notes about the multidimensional results. First, as context for the larger IMVs observed for the IMV(2PL,2F-2PL), we considered a simulation study (see SI-S2.8); IMVs greater than 0.02 can be obtained even when we simulate data with fairly strongly correlated latent factors ( ρ > 0.5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho >0.5$$\end{document} ). Second, note that, as one may expect, the correlation between the IMV(2PL,2F-2PL) and the number of items was positive ( r = 0.31 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r=0.31$$\end{document} ) which may be one component of the observed negative correlation between IMV( x j ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}$$\end{document} ,1PL) and the number of items.

5. Discussion

A great volume of psychometric research is concerned with adjudicating between different modeling approaches. A variety of approaches are available for making such decisions. In our view, these approaches have many shared weaknesses. In particular, we are concerned about a lack of portability across settings that leads to an impoverished intuition about the predictive gains associated with modeling choices among applied researchers. The IMV is a different approach that quantifies the predictive value encoded in one model relative to another. It quantifies the gain in prediction in the form of expected winnings in bets due to the side information encoded in the focal model as compared to some baseline. In contrast to most existing fit indices, the IMV is a predictive index that assesses a model’s success at out-of-sample prediction. As such, it is well suited to address an increasing interest in prediction (even when explanation is the ultimate goal; Yarkoni & Westfall, Reference Yarkoni and Westfall2017) and is highly portable in that its values can be meaningfully compared across a variety of contexts. In this paper, we describe how the IMV can be used with IRT models of dichotomous item responses.

We described a sequence of simulation studies that are collectively meant to demonstrate the utility of the IMV as a means of understanding the functioning of IRT models. We studied the IMV as a tool for understanding misfit, for example. Of particular interest is the observation that IMV(2PL,3PL) tends to be near zero as the 2PL can effectively approximate guessing via adjustment to difficulty and discrimination parameters in a way that makes the resulting out-of-sample 2PL estimates of p ij ^ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{p_{ij}}$$\end{document} highly comparable to those produced by the 3PL. Results here are similar to others (e.g., Stenhaug & Domingue, 2022) in suggesting that the 3PL might have limited utility in many settings.

We also use the inherently comparative nature of the IMV to introduce the Oracle and the Overfit values. We use these values to illustrate a few pertinent facts. When the 1PL model is used to simulate data, there is no value associated with fitting more complex models. This is expected since the innovations of the 2PL and 3PL are not necessary. Note that, even in this simple case, the IMV of truth relative to estimates (i.e., the Oracle) does not decline to zero as a function of the number of respondents alone; the number of items also needs to get increasingly large. This is a useful reminder regarding the limited utility of having ever more respondents. Even when the 2PL is used to generate data, estimates from the 2PL are overfit to the data in a way that is not true of 1PL estimates (i.e., true response-level probabilities are better predictors of new data than 1PL-based estimates but worse than 2PL-based estimates). In this sense the 1PL-based estimates are lower variance but higher bias in the sense of the bias-variance trade-off (Doroudi, Reference Doroudi2020).

We also compared the IMV to alternative metrics. The IMV was observed to be reflective of variation in the RMSE between true and estimated probabilities of responses as a function of various quantities manipulated in simulation studies in a way that made it distinctive as compared to the AIC and RMSEA. We also emphasize the ease of interpretation of the IMV. While there is guidance on generic interpretations of the other indices (e.g., Browne & Cudeck, Reference Browne and Cudeck1992 for the RMSEA and Burnham & Anderson Reference Burnham and Anderson2004 for the AIC differences), the interpretation of the IMV is aided by the quantities shown in Table 1. We used simulations to offer context to the improvements in prediction associated with different modeling innovations that, we believe, could be highly useful in future work. Table 1 describes the level of predictive power that we should expect under known conditions. For example, we should anticipate IMV(1PL,2PL) approaching 0.01 when there is in fact substantial variation in the discrimination parameters. (Recall that we simulate log a j Normal ( 0 , σ 2 ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log a_j \sim \text {Normal}(0,\sigma ^2)$$\end{document} with σ = 0.5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma =0.5$$\end{document} .) Future work with the 2PL can still be informed by this rough benchmark (and, of course, more precise benchmarks can be obtained; e.g., McNeish & Wolf, 2021). Finally, note also the flexibility of the IMV. We use it here to study the effects of sample size, priors, and estimation algorithms on prediction quality; this flexibility is a vital component of the IMV’s appeal.

In our view, the evidence from simulation and empirical work suggests that the IMV is a useful new tool for understanding the performance of IRT models. The IMV approach provides a complementary perspective based on the level of predictive difference across models rather than attempts to ascertain the true model; in particular, values from application of different IRT models in empirical settings can be compared to the quantities in Table 1 so as to indicate whether the performance of a model in a given data context is unique or as expected. The IMV’s focus on how models predict new data is important. We agree with others (Yarkoni and Westfall, Reference Yarkoni and Westfall2017) that prediction is relevant even when the goal is to identify mechanisms.

The simplicity of the IMV—it merely requires predictions of responses generated by any mechanism—suggests that it could be further used in other settings. For example, the IMV could be used to understand the behavior of specific items. There are also possibilities of further extending the IMV to deal with non-dichotomous responses. While there are complexities associated with such extensions, advances on this front would be of potential utility as they might allow for straightforward comparison of the performance of IRT models across both dichotomous and polytomous items. This simplicity comes at one potential cost, however. The IMV’s reliance on cross-validation may necessitate somewhat larger samples; future work can focus on the implications of using this method with smaller samples.

In the future, the IMV could be used to further quantify the degree to which modeling innovations provide value in predicting new data relative to conventional alternatives. Innovations in psychometric techniques are welcome, even ones that produce relatively marginal predictive improvements. However, it is our view that a firmer foundation for future development of psychometric models would include a generalizable tool for understanding the magnitudes of predictive improvement offered by a given innovation. The IMV is such a tool.

Acknowledgements

We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results. The authors would like to acknowledge helpful comments from Leah Feuerstahler, Scott Monroe, David Torres Irribarra, Roy Levy, and the members of the Measurement Lab at the Harvard Graduate School of Education.

Data Availability

This paper relies on a mixture of public and private data. Links to public data are shared in the SI. Code is also available, see SI and https://github.com/intermodelvigorish.

Declarations

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Footnotes

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s11336-024-09977-2.

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

1 This idea dates back more broadly to the generalized cross-validation (GCV) method (Craven and Wahba, Reference Craven and Wahba1978) used to estimate the correct degree of smoothing noisy data with spline functions.

2 In the sense that the IMV can be used as a relative metric, it is similar to, for example, the Tucker–Lewis index which has been used in structural equation modeling (Maydeu-Olivares and Garcia-Forero, Reference Maydeu-Olivares and Garcia-Forero2010; Han et al., 2022; Cai et al., 2021).

3 Note that the x j ¯ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\bar{x_j}$$\end{document} value is also termed the 0PL in some settings. However, there is ambiguity in how this term is used with some usage indicating predictions invariant across persons (as used here; see the “0PL-item” model in Reddy, Labutov, Banerjee, & Joachims, 2016), whereas in other cases it indicates predictions invariant across items (Haberman et al., Reference Haberman, Sinharay and Lee2011; Wainer, Reference Wainer2016).

4 In simulation settings, we will generate new data for testing purposes from the known data generating model. In empirical settings, we use K-fold cross-validation (James et al., 2013).

5 We considered the sensitivity to choice of prior in simulation studies, see SI-S2.4.

6 Ancillary analyses suggest that we do observe systematic variation in the 2PL Oracle across values of σ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma $$\end{document} if the sample size is smaller.

7 Note that the RMSE would not, in general, be portable across simulation studies given the dependence on E ( θ - b j ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}(\theta -b_j)$$\end{document} . However, in the simulation study here we do not vary E ( θ - b j ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {E}(\theta -b_j)$$\end{document} and thus are focusing on values of the RMSE that are directly comparable.

8 We conduct a small simulation study regarding the choice of k, see SI-S3.2. Resulting IMVs are relatively insensitive to the particular choice of k.

9 Specifically, we use the person-level ability estimates θ i \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i$$\end{document} and the 1PL difficulty estimates δ j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta _j$$\end{document} and set the probability of response where θ i > b j \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta _i>b_j$$\end{document} as 0.99 and 0.01 otherwise. (Note that we cannot use 1 and 0, respectively, given that these would lead to malformed likelihoods.)

10 We considered multidimensional analysis of a subset of the 89 empirical datasets. We required that the dataset contains observations from at least 500 respondents and used a random sample of 25,000 respondents rather than 10,000 respondents for larger datasets. We also omit results for datasets wherein either the estimation algorithm did not converge or the estimated abilities were effectively identical (a mean absolute deviation between abilities of less than 0.001). These restrictions left us with results for 60 datasets.

References

Akaike, H. (1973). Maximum likelihood identification of gaussian autoregressive moving average models. Biometrika, 60(2), 255265.CrossRefGoogle Scholar
Browne, M. W., Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological methods & research, 21(2), 230258.CrossRefGoogle Scholar
Burnham, K. P., Anderson, D. R. (2004). Multimodel inference: understanding aic and bic in model selection. Sociological methods & research, 33(2), 261304.CrossRefGoogle Scholar
Cai, L., Chung, S.W., & Lee, T. (2021). Incremental model fit assessment in the case of categorical data: Tucker–lewis index for item response theory modeling. Prevention Science, 1–12.Google Scholar
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the r environment. Journal of statistical Software, 48(1), 129.CrossRefGoogle Scholar
Craven, P., Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische mathematik, 31(4), 377403.CrossRefGoogle Scholar
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. ERIC.Google Scholar
Domingue, B., & Kanopka, K. (2023). The item response warehouse (irw). https://osf.io/preprints/psyarxiv/7bd54.Google Scholar
Domingue, B., Rahal, C., Faul, J., Freese, J., Kanopka, K., Rigos, A., & Tripathi, A. (2021). Intermodel vigorish (imv): A novel approach for quantifying predictive accuracy when outcomes are binary. Retrieved from https://osf.io/gu3ap/.Google Scholar
Doroudi, S. (2020). The bias-variance tradeoff: How data science can inform educational debates. AERA Open, 6(4), 2332858420977208.CrossRefGoogle Scholar
Eysenck, H.J., & Eysenck, S.B. (1968). Eysenck personality inventory. Journal of Clinical Psychology.Google Scholar
Feuerstahler, L.M. (2020). Metric stability in item response models. Multivariate Behavioral Research, 1–18.Google Scholar
Gilbert, J.B., Kim, J. S., & Miratrix, L.W. (2023). Modeling item-level heterogeneous treatment effects with the explanatory item response model: Leveraging large-scale online assessments to pinpoint the impact of educational interventions. Journal of Educational and Behavioral Statistics, 10769986231171710.CrossRefGoogle Scholar
Guttman, L. (1950). The basis for scalogram analysis. Measurement and prediction, 60–90.Google Scholar
Haberman, S. J. (2005). Identifiability of parameters in item response models with unconstrained ability distributions. ETS Research Report Series, 2005(2), i22.Google Scholar
Haberman, S. J., Sinharay, S., Lee, Y.-H. (2011). Statistical procedures to evaluate quality of scale anchoring. ETS Research Report Series, 2011(1), i20.CrossRefGoogle Scholar
Han, Y., Zhang, J., Jiang, Z., & Shi, D. (2022). Is the area under curve appropriate for evaluating the fit of psychometric models? Educational and Psychological Measurement, 00131644221098182.CrossRefGoogle Scholar
Hanley, J. A., McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1), 2936.CrossRefGoogle Scholar
Haslbeck, J., van Bork, R. (2024). Estimating the number of factors in exploratory factor analysis via out-of-sample prediction errors. Psychological Methods, 29(1), 4864.CrossRefGoogle ScholarPubMed
Hofman, J. M., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J. etal (2021). Integrating explanation and prediction in computational social science. Nature, 595(7866), 181188.CrossRefGoogle ScholarPubMed
James, G., Witten, D., Hastie, T., Tibshirani, R., et al. (2013). An introduction to statistical learning (Vol. 112). Springer.CrossRefGoogle Scholar
Kang, T., Cohen, A. S. (2007). Irt model selection methods for dichotomous items. Applied Psychological Measurement, 31(4), 331358.CrossRefGoogle Scholar
Köhler, C., Robitzsch, A., Hartig, J. (2020). A bias-corrected rmsd item fit statistic: An evaluation and comparison to alternatives. Journal of Educational and Behavioral Statistics, 45(3), 251273.CrossRefGoogle Scholar
Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. IAP.Google Scholar
Maris, G., Bechger, T. (2009). On interpreting the model parameters for the three parameter logistic model. Measurement, 7(2), 7588.Google Scholar
Mavridis, D., Moustaki, I., & Knott, M. (2007). Goodness-of-fit measures for latent variable models for binary data. In Handbook of latent variable and related models (pp. 135–161). Elsevier.Google Scholar
Maydeu-Olivares, A. (2013). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research and Perspectives, 11(3), 71101.Google Scholar
Maydeu-Olivares, A., Cai, L., Hernández, A. (2011). Comparing the fit of item response theory and factor analysis models. Structural Equation Modeling: A Multidisciplinary Journal, 18(3), 333356.CrossRefGoogle Scholar
Maydeu-Olivares, A., Garcia-Forero, C. (2010). Goodness-of-fit testing. International encyclopedia of education, 7(1), 190196.CrossRefGoogle Scholar
Maydeu-Olivares, A., Joe, H. (2005). Limited-and full-information estimation and goodness-of-fit testing in 2 n contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 10091020.CrossRefGoogle Scholar
McNeish, D., & Wolf, M.G. (2021). Dynamic fit index cutoffs for confirmatory factor analysis models. Psychological Methods.Google Scholar
Rahal, C., Verhagen, M., & Kirk, D. (2022). The rise of machine learning in the academic social sciences. AI & Society.Google Scholar
Reddy, S., Labutov, I., Banerjee, S., & Joachims, T. (2016). Unbounded human learning: Optimal scheduling for spaced repetition. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 1815–1824).CrossRefGoogle Scholar
Rizopoulos, D. (2006). ltm: An r package for latent variable modelling and item response theory analyses. Journal of Statistical Software, 17(5), 125.CrossRefGoogle Scholar
Savalei, V., Brace, J., & Fouladi, R.T. (2021, May). We need to change how we compute rmsea for nested model comparisons in structural equation modeling. PsyArXiv. Retrieved from psyarxiv.com/wprg8 https://doi.org/10.31234/osf.io/wprg8.CrossRefGoogle Scholar
Savcisens, G., Eliassi-Rad, T., Hansen, L. K., Mortensen, L. H., Lilleholt, L., Rogers, A., & Lehmann, S. (2023). Using sequences of life-events to predict human lives. Nature Computational Science, 1–14.CrossRefGoogle Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 461–464.CrossRefGoogle Scholar
Shmueli, G. (2010). To explain or to predict?. Statistical science, 25(3), 289310.CrossRefGoogle Scholar
Sijtsma, K. (2012). Psychological measurement between physics and statistics. Theory & Psychology, 22(6), 786809.CrossRefGoogle Scholar
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the royal statistical society: Series b (statistical methodology), 64(4), 583639.CrossRefGoogle Scholar
Stenhaug, B., & Domingue, B. (2022). Predictive fit metrics for item response models. Applied Psychological Measurement.CrossRefGoogle Scholar
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and akaike’s criterion. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 4447.CrossRefGoogle Scholar
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52(4), 589617.CrossRefGoogle Scholar
Swaminathan, H., Hambleton, R. K., Rogers, H. J. (2006). 21 assessing the fit of item response theory models. Handbook of statistics, 26, 683718.CrossRefGoogle Scholar
Van der Linden, W. J. (2017a). Handbook of item response theory: Volume 2: Statistical tools. CRC Press.CrossRefGoogle Scholar
Van der Linden, W.J. (2017b). Handbook of item response theory: Volume 3: Applications. CRC press.CrossRefGoogle Scholar
Van Maanen, L., Been, P., & Sijtsma, K. (1989). Problem solving strategies and the linear logistic test model. In Mathematical psychology in progress (pp. 267–287). Springer.CrossRefGoogle Scholar
Verhagen, M. D. (2022). A pragmatist’s guide to using prediction in the social sciences. Socius, 8, 23780231221081702.CrossRefGoogle Scholar
von Davier, M. (2009). Is there need for the 3pl model? guess what? Measurement: Interdisciplinary Research and Perspectives, 27 .CrossRefGoogle Scholar
Wagenmakers, E.-J., Farrell, S. (2004). Aic model selection using akaike weights. Psychonomic bulletin & review, 11(1), 192196.CrossRefGoogle ScholarPubMed
Wainer, H. (2016). Discussion of david thissen’s bad questions: An essay involving item response theory. Journal of Educational and Behavioral Statistics, 41(1), 100103.CrossRefGoogle Scholar
Watts, D. J. (2014). Common sense and sociological explanations. American Journal of Sociology, 120(2), 313351.CrossRefGoogle ScholarPubMed
Watts, D. J. (2017). Should social science be more solution-oriented?. Nature Human Behaviour, 1(1), 15.CrossRefGoogle Scholar
Watts, D. J., Beck, E. D., Bienenstock, E. J., Bowers, J., Frank, A., Grubesic, A., Salganik, M. (2018). Explanation, prediction, and causality: Three sides of the same coin?.CrossRefGoogle Scholar
Wolfram, T., Tropf, F. C., & Rahal, C. (2022, May). Short essays written during childhood predict cognition and educational attainment close to or better than expert assessment. SocArXiv. Retrieved from osf.io/preprints/socarxiv/a8ht9 https://doi.org/10.31235/osf.io/a8ht9.CrossRefGoogle Scholar
Wooldridge, J. M. (2013). Introductory econometrics: A modern approach (5th ed.). Cengage Learning.Google Scholar
Wu, M., & Adams, R. J. (2013). Properties of rasch residual fit statistics. Journal of Applied Measurement.Google Scholar
Yarkoni, T., Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 11001122.CrossRefGoogle ScholarPubMed
Figure 0

Figure 1 The cost of misfit: IMV values for the 3PL relative to the 2PL (red) and the 2PL relative to the 1PL (blue). Points are averages across 100 datasets for each configuration of parameters (1000 respondents in all cases); line segments represent span of 0.025 to 0.975 quantiles over the 100 datasets. The “(N,M)” in parentheses are shorthand for the N and M parameter logistic IRT models.

Figure 1

Figure 2 Oracle and Overfit values computed for simulations in Fig. 1. Solid dots represent Oracle values where the number indicates the IRT model (e.g., 3 indicates the 3PL). Hollow dots indicate Overfit values.

Figure 2

Figure 3 Simulations comparing a variety of metrics (in columns; along with RMSE as compared to the true/known probabilities used to generate item responses) for 1/2/3PL estimates (shown as different colors). Data are generated via the 2PL based on different numbers of items (top; 1000 respondents) or different numbers of people (bottom; 50 items). Solid lines indicate comparisons; in the first row, the dashed lines indicate raw RMSEs for the 3 models. Results are based on LOESS smoothing for 1000 choices of the component of sample size being varied (top, items; bottom, people).

Figure 3

Table 1 Approximate expected IMVs for different modeling scenarios.

Figure 4

Figure 4 IMVs computed using different models with 89 empirical datasets. Gray lines are similarly placed in each figure to emphasize comparability across results (and average IMVs for each approach are also shown). Left: IMV(CTT,1PL) as a function of the average response \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\bar{x}$$\end{document} in a given dataset. Right: IMVs contrasting a range of IRT models.

Figure 5

Table 2 Correlations between IMV values (for 1PL, 2PL, 3PL, and 2F-2PL models) and key dataset descriptives (numbers of people and items; |.5-x¯|\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$|.5-\bar{x}|$$\end{document} where \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\bar{x}$$\end{document} is the average response for a dataset) for empirical analyses.

Supplementary material: File

Domingue et al. Supplementary material

Domingue et al. Supplementary material
Download Domingue et al. Supplementary material(File)
File 426.2 KB