Hostname: page-component-669899f699-rg895 Total loading time: 0 Render date: 2025-04-25T20:46:18.218Z Has data issue: false hasContentIssue false

The InterModel Vigorish as a Lens for Understanding (and Quantifying) the Value of Item Response Models for Dichotomously Coded Items

Published online by Cambridge University Press:  01 January 2025

Benjamin W. Domingue*
Affiliation:
Stanford University
Klint Kanopka
Affiliation:
Stanford University
Radhika Kapoor
Affiliation:
Stanford University
Steffi Pohl
Affiliation:
Freie Universität Berlin
R. Philip Chalmers
Affiliation:
York University
Charles Rahal
Affiliation:
University of Oxford
Mijke Rhemtulla
Affiliation:
University of California, Davis
*
Correspondence should be made to Benjamin W. Domingue, Graduate School of Education, Stanford University, Santa Clara, USA. Email: [email protected]

Abstract

The deployment of statistical models—such as those used in item response theory—necessitates the use of indices that are informative about the degree to which a given model is appropriate for a specific data context. We introduce the InterModel Vigorish (IMV) as an index that can be used to quantify accuracy for models of dichotomous item responses based on the improvement across two sets of predictions (i.e., predictions from two item response models or predictions from a single such model relative to prediction based on the mean). This index has a range of desirable features: It can be used for the comparison of non-nested models and its values are highly portable and generalizable. We use this fact to compare predictive performance across a variety of simulated data contexts and also demonstrate qualitative differences in behavior between the IMV and other common indices (e.g., the AIC and RMSEA). We also illustrate the utility of the IMV in empirical applications with data from 89 dichotomous item response datasets. These empirical applications help illustrate how the IMV can be used in practice and substantiate our claims regarding various aspects of model performance. These findings indicate that the IMV may be a useful indicator in psychometrics, especially as it allows for easy comparison of predictions across a variety of contexts.

Type
Theory & Methods
Copyright
Copyright © 2024 The Author(s), under exclusive licence to The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s11336-024-09977-2.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

References

Akaike, H. (1973). Maximum likelihood identification of gaussian autoregressive moving average models. Biometrika, 60(2), 255265.CrossRefGoogle Scholar
Browne, M. W., Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological methods & research, 21(2), 230258.CrossRefGoogle Scholar
Burnham, K. P., Anderson, D. R. (2004). Multimodel inference: understanding aic and bic in model selection. Sociological methods & research, 33(2), 261304.CrossRefGoogle Scholar
Cai, L., Chung, S.W., & Lee, T. (2021). Incremental model fit assessment in the case of categorical data: Tucker–lewis index for item response theory modeling. Prevention Science, 1–12.Google Scholar
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the r environment. Journal of statistical Software, 48(1), 129.CrossRefGoogle Scholar
Craven, P., Wahba, G. (1978). Smoothing noisy data with spline functions. Numerische mathematik, 31(4), 377403.CrossRefGoogle Scholar
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. ERIC.Google Scholar
Domingue, B., & Kanopka, K. (2023). The item response warehouse (irw). https://osf.io/preprints/psyarxiv/7bd54.Google Scholar
Domingue, B., Rahal, C., Faul, J., Freese, J., Kanopka, K., Rigos, A., & Tripathi, A. (2021). Intermodel vigorish (imv): A novel approach for quantifying predictive accuracy when outcomes are binary. Retrieved from https://osf.io/gu3ap/.Google Scholar
Doroudi, S. (2020). The bias-variance tradeoff: How data science can inform educational debates. AERA Open, 6(4), 2332858420977208.CrossRefGoogle Scholar
Eysenck, H.J., & Eysenck, S.B. (1968). Eysenck personality inventory. Journal of Clinical Psychology.Google Scholar
Feuerstahler, L.M. (2020). Metric stability in item response models. Multivariate Behavioral Research, 1–18.Google Scholar
Gilbert, J.B., Kim, J. S., & Miratrix, L.W. (2023). Modeling item-level heterogeneous treatment effects with the explanatory item response model: Leveraging large-scale online assessments to pinpoint the impact of educational interventions. Journal of Educational and Behavioral Statistics, 10769986231171710.CrossRefGoogle Scholar
Guttman, L. (1950). The basis for scalogram analysis. Measurement and prediction, 60–90.Google Scholar
Haberman, S. J. (2005). Identifiability of parameters in item response models with unconstrained ability distributions. ETS Research Report Series, 2005(2), i22.Google Scholar
Haberman, S. J., Sinharay, S., Lee, Y.-H. (2011). Statistical procedures to evaluate quality of scale anchoring. ETS Research Report Series, 2011(1), i20.CrossRefGoogle Scholar
Han, Y., Zhang, J., Jiang, Z., & Shi, D. (2022). Is the area under curve appropriate for evaluating the fit of psychometric models? Educational and Psychological Measurement, 00131644221098182.CrossRefGoogle Scholar
Hanley, J. A., McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1), 2936.CrossRefGoogle Scholar
Haslbeck, J., van Bork, R. (2024). Estimating the number of factors in exploratory factor analysis via out-of-sample prediction errors. Psychological Methods, 29(1), 4864.CrossRefGoogle ScholarPubMed
Hofman, J. M., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J. etal (2021). Integrating explanation and prediction in computational social science. Nature, 595(7866), 181188.CrossRefGoogle ScholarPubMed
James, G., Witten, D., Hastie, T., Tibshirani, R., et al. (2013). An introduction to statistical learning (Vol. 112). Springer.CrossRefGoogle Scholar
Kang, T., Cohen, A. S. (2007). Irt model selection methods for dichotomous items. Applied Psychological Measurement, 31(4), 331358.CrossRefGoogle Scholar
Köhler, C., Robitzsch, A., Hartig, J. (2020). A bias-corrected rmsd item fit statistic: An evaluation and comparison to alternatives. Journal of Educational and Behavioral Statistics, 45(3), 251273.CrossRefGoogle Scholar
Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. IAP.Google Scholar
Maris, G., Bechger, T. (2009). On interpreting the model parameters for the three parameter logistic model. Measurement, 7(2), 7588.Google Scholar
Mavridis, D., Moustaki, I., & Knott, M. (2007). Goodness-of-fit measures for latent variable models for binary data. In Handbook of latent variable and related models (pp. 135–161). Elsevier.Google Scholar
Maydeu-Olivares, A. (2013). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research and Perspectives, 11(3), 71101.Google Scholar
Maydeu-Olivares, A., Cai, L., Hernández, A. (2011). Comparing the fit of item response theory and factor analysis models. Structural Equation Modeling: A Multidisciplinary Journal, 18(3), 333356.CrossRefGoogle Scholar
Maydeu-Olivares, A., Garcia-Forero, C. (2010). Goodness-of-fit testing. International encyclopedia of education, 7(1), 190196.CrossRefGoogle Scholar
Maydeu-Olivares, A., Joe, H. (2005). Limited-and full-information estimation and goodness-of-fit testing in 2 n contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 10091020.CrossRefGoogle Scholar
McNeish, D., & Wolf, M.G. (2021). Dynamic fit index cutoffs for confirmatory factor analysis models. Psychological Methods.Google Scholar
Rahal, C., Verhagen, M., & Kirk, D. (2022). The rise of machine learning in the academic social sciences. AI & Society.Google Scholar
Reddy, S., Labutov, I., Banerjee, S., & Joachims, T. (2016). Unbounded human learning: Optimal scheduling for spaced repetition. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 1815–1824).CrossRefGoogle Scholar
Rizopoulos, D. (2006). ltm: An r package for latent variable modelling and item response theory analyses. Journal of Statistical Software, 17(5), 125.CrossRefGoogle Scholar
Savalei, V., Brace, J., & Fouladi, R.T. (2021, May). We need to change how we compute rmsea for nested model comparisons in structural equation modeling. PsyArXiv. Retrieved from psyarxiv.com/wprg8 https://doi.org/10.31234/osf.io/wprg8.CrossRefGoogle Scholar
Savcisens, G., Eliassi-Rad, T., Hansen, L. K., Mortensen, L. H., Lilleholt, L., Rogers, A., & Lehmann, S. (2023). Using sequences of life-events to predict human lives. Nature Computational Science, 1–14.CrossRefGoogle Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 461–464.CrossRefGoogle Scholar
Shmueli, G. (2010). To explain or to predict?. Statistical science, 25(3), 289310.CrossRefGoogle Scholar
Sijtsma, K. (2012). Psychological measurement between physics and statistics. Theory & Psychology, 22(6), 786809.CrossRefGoogle Scholar
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the royal statistical society: Series b (statistical methodology), 64(4), 583639.CrossRefGoogle Scholar
Stenhaug, B., & Domingue, B. (2022). Predictive fit metrics for item response models. Applied Psychological Measurement.CrossRefGoogle Scholar
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and akaike’s criterion. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 4447.CrossRefGoogle Scholar
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52(4), 589617.CrossRefGoogle Scholar
Swaminathan, H., Hambleton, R. K., Rogers, H. J. (2006). 21 assessing the fit of item response theory models. Handbook of statistics, 26, 683718.CrossRefGoogle Scholar
Van der Linden, W. J. (2017a). Handbook of item response theory: Volume 2: Statistical tools. CRC Press.CrossRefGoogle Scholar
Van der Linden, W.J. (2017b). Handbook of item response theory: Volume 3: Applications. CRC press.CrossRefGoogle Scholar
Van Maanen, L., Been, P., & Sijtsma, K. (1989). Problem solving strategies and the linear logistic test model. In Mathematical psychology in progress (pp. 267–287). Springer.CrossRefGoogle Scholar
Verhagen, M. D. (2022). A pragmatist’s guide to using prediction in the social sciences. Socius, 8, 23780231221081702.CrossRefGoogle Scholar
von Davier, M. (2009). Is there need for the 3pl model? guess what? Measurement: Interdisciplinary Research and Perspectives, 27 .CrossRefGoogle Scholar
Wagenmakers, E.-J., Farrell, S. (2004). Aic model selection using akaike weights. Psychonomic bulletin & review, 11(1), 192196.CrossRefGoogle ScholarPubMed
Wainer, H. (2016). Discussion of david thissen’s bad questions: An essay involving item response theory. Journal of Educational and Behavioral Statistics, 41(1), 100103.CrossRefGoogle Scholar
Watts, D. J. (2014). Common sense and sociological explanations. American Journal of Sociology, 120(2), 313351.CrossRefGoogle ScholarPubMed
Watts, D. J. (2017). Should social science be more solution-oriented?. Nature Human Behaviour, 1(1), 15.CrossRefGoogle Scholar
Watts, D. J., Beck, E. D., Bienenstock, E. J., Bowers, J., Frank, A., Grubesic, A., Salganik, M. (2018). Explanation, prediction, and causality: Three sides of the same coin?.CrossRefGoogle Scholar
Wolfram, T., Tropf, F. C., & Rahal, C. (2022, May). Short essays written during childhood predict cognition and educational attainment close to or better than expert assessment. SocArXiv. Retrieved from osf.io/preprints/socarxiv/a8ht9 https://doi.org/10.31235/osf.io/a8ht9.CrossRefGoogle Scholar
Wooldridge, J. M. (2013). Introductory econometrics: A modern approach (5th ed.). Cengage Learning.Google Scholar
Wu, M., & Adams, R. J. (2013). Properties of rasch residual fit statistics. Journal of Applied Measurement.Google Scholar
Yarkoni, T., Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 11001122.CrossRefGoogle ScholarPubMed
Supplementary material: File

Domingue et al. Supplementary material

Domingue et al. Supplementary material
Download Domingue et al. Supplementary material(File)
File 426.2 KB