Hostname: page-component-745bb68f8f-cphqk Total loading time: 0 Render date: 2025-01-08T10:21:53.727Z Has data issue: false hasContentIssue false

Observed-Score Equating: An Overview

Published online by Cambridge University Press:  01 January 2025

Alina A. von Davier*
Affiliation:
Educational Testing Service
*
Requests for reprints should be sent to Alina A. von Davier, Educational Testing Service, Princeton, NJ, USA. E-mail: [email protected]

Abstract

In this paper, an overview of the observed-score equating (OSE) process is provided from the perspective of a unifying equating framework (von Davier in von Davier (Ed.), Statistical models for test equating, scaling, and linking, Springer, New York, pp. 1–17, 2011b). The framework includes all OSE approaches. Issues related to the test, common items, and sampling designs and their relationship to measurement and equating are discussed. Challenges to the equating process, model assumptions, and approaches to equating evaluation are also presented. The equating process is illustrated step-by-step with a real data example from a licensure test.

Type
Original Paper
Copyright
Copyright © 2013 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Berger, M.P.F. (1997). Optimal designs for latent variable models: a review. In Rost, J., & Langeheine, R. (Eds.), Application of latent trait and latent class models in the social sciences (pp. 7179). Muenster: Waxmann.Google Scholar
Bishop, Y.M.M., Fienberg, S.E., & Holland, P.W. (1975). Discrete multivariate analysis: theory and practice. Cambridge: MIT Press.Google Scholar
Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika, 52(3), 345370.CrossRefGoogle Scholar
Braun, H.I., & Holland, P.W. (1982). Observed-score test equating: a mathematical analysis of some ETS equating procedures. In Holland, P.W., & Rubin, D.B. (Eds.), Test equating (pp. 949). New York: Academic Press.Google Scholar
Chamberlain, T.C. (1899). On Lord Klevon’s address on the age of the earth as an abode fitted for life. Annual Report of the Smithsonian Institution, 1899, 223246.Google Scholar
Chen, H., Livingston, S., & Holland, P.W. (2011). Generalized equating functions for NEAT designs. In von Davier, A.A. Statistical models for test equating, scaling, and linking (pp. 185200). New York: Springer.Google Scholar
Chen, H., Yan, D., Han, N., & von Davier, A.A. (2006). LOGLIN/KE user guide: version 2.1. Princeton: ETS.Google Scholar
Cook, L.L., Eignor, D.R., & Taft, H.L. (1988). A comparative study of the effects of instruction on the stability of IRT and conventional item parameter statistics. Journal of Educational Measurement, 25(1), 3145.CrossRefGoogle Scholar
Dorans, N.J. (2002). Recentering and realigning the SAT score distributions: how and why. Journal of Educational Measurement, 39(1), 5984.CrossRefGoogle Scholar
Dorans, N.J., & Feigenbaum, M.D. (1994). Equating issues engendered by changes to the SAT and PSAT/NMSQT (ETS Research Memorandum No. RM-94-10). Princeton: ETS.Google Scholar
Dorans, N.J., & Holland, P.W. (2000). Population invariance and equitability of tests: basic theory and the linear case. Journal of Educational Measurement, 37(4), 281306.CrossRefGoogle Scholar
Dorans, N.J., Moses, T., & Eignor, D. (2011). Equating test scores: towards best practices. In von Davier, A.A. (Ed.), Statistical models for test equating, scaling, and linking (pp. 2142). New York: Springer.Google Scholar
Dorans, N.J., Pommerich, M., & Holland, P.W. (Eds.) (2007). Linking and aligning scores and scales. New York: Springer.CrossRefGoogle Scholar
Duong, M.Q., & von Davier, A.A. (2012). Observed-score equating with a heterogeneous target population. International Journal of Testing, 12(3), 224251.CrossRefGoogle Scholar
ETS (2011). LOGLIN/KE software version 2 [Computer software]. Princeton: ETS.Google Scholar
Feuer, M.J., Holland, P.W., Green, B.F., Bertenthal, M.W., & Hemphill, F.C. (Eds.) (1999). Uncommon measures: equivalence and linkage among educational tests (Report of the Committee on Equivalency and Linkage of Educational Tests, National Research Council). Washington: National Academy Press.Google Scholar
Haberman, S.J. (2011). Using exponential families for equating. In von Davier, A.A. (Ed.), Statistical models for test equating, scaling, and linking (pp. 125140). New York: Springer.Google Scholar
Holland, P.W., & Dorans, N.J. (2006). Linking and equating. In Brennan, R.L. (Ed.), Educational measurement. (4th ed., pp. 189220). Westport: Praeger.Google Scholar
Holland, P.W., & Hoskens, M. (2003). Classical test theory as a first-order item response theory: application to true-score prediction from a possibly nonparallel test. Psychometrika, 68, 123149.CrossRefGoogle Scholar
Holland, P.W., & Thayer, D.T. (1987). Notes on the use of log-linear models for fitting discrete probability distributions (ETS Research Rep. No. RR-87-31). Princeton: ETS.Google Scholar
Holland, P.W., & Thayer, D.T. (1989). The kernel method of equating score distributions (ETS Research Rep. No. 89-07). Princeton: ETS.Google Scholar
Holland, P.W., & Thayer, D.T. (2000). Univariate and bivariate loglinear models for discrete test score distributions. Journal of Educational and Behavioral Statistics, 25, 133183.CrossRefGoogle Scholar
Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale: Erlbaum.Google Scholar
Jiang, Y., von Davier, A.A., & Chen, H. (2012). Evaluating equating results: percent relative error for chained kernel equating. Journal of Educational Measurement, 49, 3958.CrossRefGoogle Scholar
Karabatsos, G., & Walker, S.G. (2011). A Bayesian nonparametric model for test equating. In von Davier, A.A. Statistical models for test equating, scaling, and linking (pp. 175185). New York: Springer.Google Scholar
Kendall, M.G., & Stuart, A. (1977). The advanced theory of statistics. (4th ed.). New York: Macmillan.Google Scholar
Kolen, M.J., & Brennan, R.J. (2004). Test equating: methods and practices. (2nd ed.). New York: Springer.Google Scholar
Lee, Y., & von Davier, A.A. (2013, in press). Monitoring scale scores over time via quality control charts, model-based approaches, and time series techniques. Psychometrika.CrossRefGoogle Scholar
Lee, Y.-H., & von Davier, A.A. (2011). Equating through alternative kernels. In von Davier, A.A. (Ed.), Statistical models for test equating, scaling, and linking (pp. 159173). New York: Springer.Google Scholar
Li, D., Jiang, Y., & von Davier, A.A. (2012). The accuracy and consistency of a series of IRT true score equating. Journal of Educational Measurement, 49, 167189.CrossRefGoogle Scholar
Li, D., Li, S., & von Davier, A.A. (2011). Applying time-series analysis to detect scale drift. In von Davier, A.A. (Ed.), Statistical models for test equating, scaling, and linking (pp. 381398). New York: Springer.Google Scholar
Liang, L., Dorans, N.J., & Sinharay, S. (2009). First language of examinees and its relationship to equating (ETS Research Rep. No. RR-09-05). Princeton: ETS.Google Scholar
Livingston, S.A. (2004). Equating test scores (without IRT). Princeton: Educational Testing Service.Google Scholar
Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading: Addison Wesley.Google Scholar
Morris, C.N. (1982). On the foundations of test equating. In Holland, P.W., & Rubin, D.B. (Eds.), Test equating (pp. 949). New York: Academic Press.Google Scholar
Moses, T., & Holland, P.W. (2008). The influence of strategies for selecting loglinear smoothing models on equating functions (ETS Research Rep. No. RR-08-25). Princeton: ETS.Google Scholar
Moses, T., & von Davier, A.A. (2011). A SAS IML macro for loglinear smoothing applied psychological measurement. Applied Psychological Measurement, 35(3), 250251.CrossRefGoogle Scholar
Qian, J., von Davier, A.A., & Jiang, Y. (2013, submitted). Achieving a stable scale for an assessment with multiple forms—weighting test samples in IRT linking. In New developments in quantitative psychology: proceedings of the 77th international meeting of the Psychometric Society. New York: Springer.Google Scholar
Rao, C.R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.CrossRefGoogle Scholar
Rubin, D. (1982). Discussion of “Observed-score test equating: a mathematical analysis of some ETS equating procedures”. In Holland, P.W., & Rubin, D.B. (Eds.), Test equating (pp. 5154). New York: Academic Press.Google Scholar
Sinharay, S., Haberman, S., Holland, P., & Lewis, C. (2012). A note on the choice of an anchor test in equating (ETS Research Rep. No. RR-12-14). Princeton: ETS.Google Scholar
Sinharay, S., & Holland, P.W. (2010). The missing data assumption of the NEAT design and their implications for test equating. Psychometrika, 75, 309327.CrossRefGoogle Scholar
Sinharay, S., Holland, P.W., & von Davier, A.A. (2011). Evaluating the missing data assumptions of the chain and poststratification equating methods. In von Davier, A.A. (Ed.), Statistical models for test equating, scaling, and linking (pp. 381398). New York: Springer.Google Scholar
van der Linden, W.J. (2000). A test-theoretic approach to observed-score equating. Psychometrika, 65, 437456.CrossRefGoogle Scholar
van der Linden, W.J. (2011). Local observed-score equating. In von Davier, A.A. (Ed.), Statistical models for equating, scaling, and linking (pp. 201223). New York: Springer.Google Scholar
von Davier, A.A. (2007). Potential solution to practical equating issues. In Dorans, N.J., Pommerich, M., & Holland, P.W. (Eds.), Linking and aligning scores and scales, New York: Springer.Google Scholar
von Davier, A.A. (2010). Test equating for observed-scores: the percentile rank, Gaussian kernel, and IRT observed-score equating methods. Workshop presented at international meeting of the Psychometric Society, Athens, GA.Google Scholar
von Davier, A.A. (2011a). Statistical models for test equating, scaling, and linking. New York: Springer.CrossRefGoogle Scholar
von Davier, A.A. (2011b). A statistical perspective on equating test scores. In von Davier, A.A. (Ed.), Statistical models for test equating, scaling, and linking (pp. 117). New York: Springer.CrossRefGoogle Scholar
von Davier, A.A. (2011c). Quality control and data mining techniques applied to monitoring scaled scores. In Pechenizkiy, M., Calders, T., Conati, C., Ventura, S., Romero, C., & Stamper, J. (Eds.), Proceedings of the 4th international conference on educational data mining, Eindhoven, July 6–8, 2011. Eindhoven: Eindhoven University of Technology Library.Google Scholar
von Davier, A.A. (2012). Validity issues in international standardized assessments and implications for the test and sampling designs. Paper presented at educational assessment, accountability and equity: conversations on validity around the world, New York, NY, March 2012.Google Scholar
von Davier, A.A., Fournier-Zajac, S., & Holland, P. W. (2007). An equipercentile version of the Levine linear observed-score equating function using the methods of kernel equating (ETS Research Rep. No. RR-07-14). Princeton: ETS.Google Scholar
von Davier, A.A., Holland, P.W., & Thayer, D.T. (2004). The chain and poststratification methods for observed-score equating: their relationship to population invariance. Journal of Educational Measurement, 41(1), 1532.CrossRefGoogle Scholar
von Davier, A.A., Holland, P.W., & Thayer, D.T. (2004). The kernel method of test equating. New York: Springer.CrossRefGoogle Scholar
von Davier, A.A., & Kong, N. (2005). A unified approach to linear equating for the non-equivalent groups design. Journal of Educational and Behavioral Statistics, 30(3), 313334.CrossRefGoogle Scholar
von Davier, A.A., & Wilson, C. (2007). IRT true-score test equating: a guide through assumptions and applications. Journal of Educational and Psychological Measurement, 67(6), 940957.CrossRefGoogle Scholar
Wang, T. (2011). An alternative continuization method: the continuized log-linear method. In von Davier, A.A. (Ed.), Statistical models for test equating, scaling, and linking (pp. 141158). New York: Springer.Google Scholar
Wiberg, M., van der Linden, W.J., & von Davier, A.A. (2012). Local observed-score kernel equating. Paper presented at National Council of Measurement in Education, Vancouver.Google Scholar
Zumbo, B.D. (2007). Validity: foundational issues and statistical methodology. In Rao, C.R., & Sinharay, S. Handbook of statistics: Vol. 26 Psychometrics (pp. 4579). The Netherlands: Elsevier.Google Scholar