Hostname: page-component-f554764f5-44mx8 Total loading time: 0 Render date: 2025-04-23T02:24:01.241Z Has data issue: false hasContentIssue false

Differential Item Functioning via Robust Scaling

Published online by Cambridge University Press:  01 January 2025

Peter F. Halpin*
Affiliation:
University of North Carolina at Chapel Hill
*
Correspondence should be made to Peter F. Halpin, University of North Carolina at Chapel Hill, 100 E Cameron Ave, Office 1070G, Chapel Hill, NC 27514, USA. Email: [email protected]

Abstract

This paper proposes a method for assessing differential item functioning (DIF) in item response theory (IRT) models. The method does not require pre-specification of anchor items, which is its main virtue. It is developed in two main steps: first by showing how DIF can be re-formulated as a problem of outlier detection in IRT-based scaling and then tackling the latter using methods from robust statistics. The proposal is a redescending M-estimator of IRT scaling parameters that is tuned to flag items with DIF at the desired asymptotic type I error rate. Theoretical results describe the efficiency of the estimator in the absence of DIF and its robustness in the presence of DIF. Simulation studies show that the proposed method compares favorably to currently available approaches for DIF detection, and a real data example illustrates its application in a research context where pre-specification of anchor items is infeasible. The focus of the paper is the two-parameter logistic model in two independent groups, with extensions to other settings considered in the conclusion.

Type
Theory & Methods
Copyright
Copyright © 2024 The Author(s), under exclusive licence to The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

The author would like to thank Dr. Matthias von Davier for helpful comments that improved the proof of Theorem 1.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

References

Angoff, W. (1982). Use of difficulty and discrimination indices for detecting item bias. In Berk, R. (Eds), Handbook of methods for detecting test bias, Baltimore, MA: The Johns Hopkins Press 96116.Google Scholar
Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In Holland, P. W., Wainer, H. (Eds), Differential item functioning, Hillsdale, NJ: Lawrence Earlbaum Associates 323.Google Scholar
Asparouhov, T., Muthén, B. (2014). Multiple-group factor analysis alignment. Structural Equation Modeling: A Multidisciplinary Journal, 21(4), 495508.CrossRefGoogle Scholar
Bechger, T. M., Maris, G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80, 317340.CrossRefGoogle ScholarPubMed
Belzak, W. C. M., Bauer, D. J. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25, 673690.CrossRefGoogle ScholarPubMed
Bock, R. D., Gibbons, R. D. (2021). Item response theory, Hoboken, NJ: Wiley.CrossRefGoogle Scholar
Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 129.CrossRefGoogle Scholar
Chalmers, R. P., Counsell, A., Flora, D. B. (2016). It might not make a big DIF: Improved differential test functioning statistics that account for sampling variability. Educational and Psychological Measurement, 76(1), 114140.CrossRefGoogle Scholar
Cvetkovski, Z. (2012). Inequalities: Theorems, techniques and selected problems, Cham: Springer.CrossRefGoogle Scholar
Doebler, A. (2019). Looking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability. Applied Psychological Measurement, 43(4), 303321.CrossRefGoogle ScholarPubMed
Dorans, N. J., Holland, P. W. (1993). DIF detection and description: Mantel–Haenszel and standardization. In Holland, P. W., Wainer, H. (Eds), Differential item functioning, Hillsdale, NJ: Lawrence Erlbaum Associates 3566.Google Scholar
Gonzalez, O., Pelham, W. E. (2021). When does differential item functioning matter for screening? A method for empirical evaluation. Assessment, 28(2), 446456.CrossRefGoogle ScholarPubMed
Haberman, S. J. (2009). Use of generalized residuals to examine goodness of fit of item response models. ETS reseach report RR-09-15.CrossRefGoogle Scholar
He, Y. (2013). Robust scale transformation methods in IRT true score equating under common-item nonequivalent groups design, Ann Arbor: ProQuest LLC.CrossRefGoogle Scholar
He, Y., Cui, Z. (2020). Evaluating robust scale transformation methods with multiple outlying common items under IRT true score equating. Applied Psychological Measurement, 44(4), 296310.CrossRefGoogle ScholarPubMed
He, Y., Cui, Z., Osterlind, S. J. (2015). New robust scale transformation methods in the presence of outlying common items. Applied Psychological Measurement, 39(8), 613626.CrossRefGoogle ScholarPubMed
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1), 73101.CrossRefGoogle Scholar
Huber, P. J. (1984). Finite sample breakdown of M- and P-estimators. Annals of Statistics, 12, 119126.CrossRefGoogle Scholar
Huber, P. J., Ronchetti, E. (2009). Robust statistics, 2Hoboken, NJ: Wiley.CrossRefGoogle Scholar
Karabatsos, G. (2000). A critique of Rasch residual fit statistics. Journal of Applied Measurement, 1(2), 152176.Google ScholarPubMed
Kolen, M. J., Brennan, R. L. (2014). Test equating, scaling, and linking, New York, NY: Springer.CrossRefGoogle Scholar
Kopf, J., Zeileis, A., & Strobl, C. (2015a). Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educational and Psychological Measurement, 75(1), 22–56.CrossRefGoogle Scholar
Kopf, J., Zeileis, A., & Strobl, C. (2015b). A framework for anchor methods and an iterative forward approach for DIF detection. Applied Psychological Measurement, 39(2), 83–103.CrossRefGoogle Scholar
Li, G., Zhang, J. (1998). Breakdown properties of location M-estimators. The Annals of Statistics, 26(3), 11701189.CrossRefGoogle Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems, New York: Routledge.Google Scholar
Magis, D., Béland, S., Tuerlinckx, F., De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847862.CrossRefGoogle Scholar
Magis, D., Tuerlinckx, F., De Boeck, P. (2015). Detection of differential item functioning using the lasso approach. Journal of Educational and Behavioral Statistics, 40(2), 111135.CrossRefGoogle Scholar
Maronna, R. A., Martin, R. D., Yohai, V. J., Salibián-Barrera, M. (2019). Robust statistics: Theory and methods (with R), 2Hoboken, NJ: Wiley.Google Scholar
Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7(2), 105118.CrossRefGoogle Scholar
R Core Team. (2022). R: A language and environment for statistical computing.Google Scholar
Raju, N. S., van der Linden, W. J., Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353368.CrossRefGoogle Scholar
Robitzsch, A., Lüdtke, O. (2023). Why full, partial, or approximate measurement invariance are not a prerequisite for meaningful and valid group comparisons. Structural Equation Modeling: A Multidisciplinary Journal, 30(6), 859870.CrossRefGoogle Scholar
Rost, J., von Davier, M. (1994). A conditional item-fit index for Rasch models. Applied Psychological Measurement, 18(2), 171182.CrossRefGoogle Scholar
Rousseeuw, P. J., Leroy, A. M. (1987). Robust regression and outlier detection, New York: Wiley.CrossRefGoogle Scholar
Schauberger, G., Mair, P. (2020). A regularization approach for the detection of differential item functioning in generalized partial credit models. Behavior Research Methods, 52(1), 279294.CrossRefGoogle ScholarPubMed
Sireci, S. G., Rios, J. A. (2013). Decisions that make a difference in detecting differential item functioning. Educational Research and Evaluation, 19 2–3170187.CrossRefGoogle Scholar
Stenhaug, B., Frank, M. C., & Domingue, B. (2021). Treading carefully: Agnostic identification as the first step of detecting differential item functioning. Preprint, PsyArXiv.CrossRefGoogle Scholar
Stocking, M. L., Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201210.CrossRefGoogle Scholar
Strobl, C., Kopf, J., Kohler, L., von Oertzen, T., Zeileis, A. (2021). Anchor point selection: Scale alignment based on an inequality criterion. Applied Psychological Measurement, 45(3), 214230.CrossRefGoogle Scholar
Thissen, D., Steinberg, L., Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In Holland, P. W., Wainer, H. (Eds), Differential item functioning, Hillsdale, NJ: Lawrence Erlbaum Associates 67113.Google Scholar
van der Linden, W. J. (2016). Handbook of item response theory, Boca Raton, FL: CRC Press.CrossRefGoogle Scholar
van der Vaart, A. W. (1998). Asymptotic statistics, Cambridge: Cambridge University Press.CrossRefGoogle Scholar
von Davier, M. & Bezirhan, U. (2022). A robust method for detecting item misfit in large-scale assessments. Educational and Psychological Measurement, 00131644221105819.Google Scholar
Wainer, H. (1993). Model-based standardized measurement of an item’s differential impact. In Holland, P. W., Wainer, H. (Eds), Differential item functioning, Hillsdale, NJ: Lawrence Erlbaum Associates 123135.Google Scholar
Wang, W., Liu, Y., Liu, H. (2022). Testing differential item functioning without predefined anchor items using robust regression. Journal of Educational and Behavioral Statistics, 47(6), 666692.CrossRefGoogle Scholar
Yamamoto, K., Khorramdel, L., & von Davier, M. (2013). Scaling PIAAC cognitive data. In OECD (Ed.), Technical report of the survey of adult skills (PIAAC) (pp. 17.1–17.34). Paris: OECD Publishing.Google Scholar
Yohai, V. J. (1987). High breakdown-point and high efficiency robust estimates for regression. Annals of Statistics, 15(2), 642656.CrossRefGoogle Scholar
Yuan, K.-H., Liu, H., Han, Y. (2021). Differential item functioning analysis without a priori information on anchor items: QQ plots and graphical test. Psychometrika, 86, 345377.CrossRefGoogle ScholarPubMed