Identifying and Minimizing Measurement Invariance among Intersectional Groups: The Alignment Method Applied to Multi-category Items

Rachel A. Gordon; Tianxiu Wang; Hai Nguyen; Ariel M. Aloe

doi:10.1017/9781009357784

Series: Elements in Research Methods for Developmental Science

Identifying and Minimizing Measurement Invariance among Intersectional Groups

The Alignment Method Applied to Multi-category Items

Published online by Cambridge University Press: 23 June 2023

Hai Nguyen and

Rachel A. Gordon: Affiliation:
Northern Illinois University
Tianxiu Wang: Affiliation:
University of Pittsburgh
Hai Nguyen: Affiliation:
University of Illinois, Chicago
Ariel M. Aloe: Affiliation:
University of Iowa

Summary

This Element demonstrates how and why the alignment method can advance measurement fairness in developmental science. It explains its application to multi-category items in an accessible way, offering sample code and demonstrating an R package that facilitates interpretation of such items' multiple thresholds. It features the implications for group mean differences when differences in the thresholds between categories are ignored because items are treated as continuous, using an example of intersectional groups defined by assigned sex and race/ethnicity. It demonstrates the interpretation of item-level partial non-invariance results and their implications for group-level differences and encourages substantive theorizing regarding measurement fairness.

Element contents

Summary
References

Get access

Keywords

measurement fairness measurement invariance alignment method intersectionality children’s behavior socio-emotional development

Type: Element
Information: Series: Elements in Research Methods for Developmental Science

DOI: https://doi.org/10.1017/9781009357784 [Opens in a new window]

Online ISBN: 9781009357784

Publisher: Cambridge University Press

Print publication: 06 July 2023

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Element purchase

Temporarily unavailable

References

Achenbach System of Empirically Based Assessment (ASEBA). (n.d.). The ASEBA approach. https://aseba.org/.Google Scholar

Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension. American Psychologist, 63, 32–50. https://doi.org/10.1037/0003-066X.63.1.32.Google Scholar

American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. American Educational Research Association. www.testingstandards.net/open-access-files.html.Google Scholar

Asparouhov, T., & Muthén, B. (2014). Multiple-group factor analysis alignment. Structural Equation Modeling: A Multidisciplinary Journal, 21(4), 495–508. https://doi.org/10.1080/10705511.2014.919210.Google Scholar

Asparouhov, T., & Muthén, B. (2020, December). IRT in Mplus (Version 4). www.statmodel.com/download/MplusIRT.pdf.Google Scholar

Asparouhov, T., & Muthén, B. (2023). Multiple group alignment for exploratory and structural equation models. Structural Equation Modeling., 30(2), 169–191 https://doi.org/10.1080/10705511.2022.2127100.CrossRef Google Scholar

Babcock, B., & Hodge, K. J. (2020). Rasch versus classical equating in the context of small sample sizes. Educational and Psychological Measurement, 80, 499–521. https://doi.org/10.1177/0013164419878483.CrossRef Google Scholar PubMed

Bansal, P. S., Babinski, D. E., Waxmonsky, J. G., & Waschbusch, D. A. (2022). Psychometric properties of parent ratings on the Inventory of Callous–Unemotional Traits in a nationally representative sample of 5- to 12-year-olds. Assessment, 29, 242–256. https://doi.org/10.1177/1073191120964562.Google Scholar

Bauer, D. J. (2017). A more general model for testing measurement invariance and differential item functioning. Psychological Methods, 22, 507–526. https://doi.org/10.1037/met0000077.CrossRef Google Scholar PubMed

Belzak, W. C. M., & Bauer, D. J. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25, 673–690. https://doi.org/10.1037/met0000253.Google Scholar

Benjamin, L. T. Jr. (2005). A history of clinical psychology as a profession in America (and a glimpse at its future). Annual Review of Clinical Psychology, 1, 1–30. https://doi.org/10.1146/annurev.clinpsy.1.102803.143758.Google Scholar

Bodenhorn, T., Burns, J. P., & Palmer, M. (2020). Change, contradiction, and the state: Higher education in greater China. The China Quarterly, 244, 903–919. https://doi.org/10.1017/S0305741020001228.Google Scholar

Bordovsky, J. T., Krueger, R. F., Argawal, A., & Grucza, R. A. (2019). A decline in propensity toward risk behaviors among U. S. adolescents. Journal of Adolescent Health, 65, 745–751. https://doi.org/10.1016/j.jadohealth.2019.07.001.CrossRef Google Scholar

Boulkedid, R., Abdoul, H., Loustau, M., Sibony, O., & Alberti, C. (2011). Using and reporting the Delphi method for selecting healthcare quality indicators: A systematic review. PloS One, 6(6), e20476. https://doi.org/10.1371/journal.pone.0020476.CrossRef Google Scholar PubMed

Bratt, C., Abrams, D., Swift, H. J., Vauclair, C. M., & Marques, S. (2018). Perceived age discrimination across age in Europe: From an ageing society to a society for all ages. Developmental Psychology, 54, 167–180. https://doi.org/10.1037/dev0000398.CrossRef Google Scholar PubMed

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multi-model inference. Springer-Verlag.Google Scholar

Buss, A. H., & Perry, M. P. (1992). The aggression questionnaire. Journal of Personality and Social Psychology, 63, 452–459. https://doi.org/10.1037/0022-3514.63.3.452.Google Scholar

Buss, A. H., & Warren, W. L. (2000). Aggression questionnaire. WPS. www.wpspublish.com/aq-aggression-questionnaire.Google Scholar

Byrne, B. M., Oakland, T., Leong, F. T. L. et al. (2009). A critical analysis of cross-cultural research and testing practices: Implications for improved education and training in psychology. Training and Education in Professional Psychology, 3, 94–105. https://doi.org/10.1037/a0014516.Google Scholar

Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456–466. https://doi.org/10.1037/0033-2909.105.3.456.CrossRef Google Scholar

Camilli, G. (2006). Test fairness. In Brennan, R. L. (Ed.), Educational measurement (4th ed., pp. 221–256). Praeger.Google Scholar

Cheung, G. W., & Lau, R. S. (2012). A direct comparison approach for testing measurement invariance. Organizational Research Methods, 15, 167–198. https://doi.org/10.1177/1094428111421987.Google Scholar

Chilisa, B. (2020). Indigenous research methodologies. Sage.Google Scholar

Cleveland, H. H., Wiebe, R. P., van den Oord, E. J. C. G., & Rowe, D. C. (2000). Behavior problems among children from different family structures: The influence of genetic self-selection. Child Development, 71, 733–751. https://doi.org/10.1111/1467-8624.00182.Google Scholar

Covarrubias, A., & Vélez, V. (2013). Critical race quantitative intersectionality: An antiracist research paradigm that refuses to “let the numbers speak for themselves.” In Lynn, M. & Dixson, A. D. (Eds.), Handbook of critical race theory in education (pp. 270–285). Routledge.Google Scholar

Crenshaw, K. (1989). Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. University of Chicago Legal Forum, 1989 (1), Article 8, 139–167.Google Scholar

Crowder, M. K., Gordon, R. A., Brown, R. D., Davidson, L. A., & Domitrovich, C. E. (2019). Linking social and emotional learning standards to the WCSD Social-Emotional Competency Assessment: A Rasch approach. School Psychology Quarterly, 34, 281–295. https://doi.org/10.1037/spq0000308.Google Scholar

Davidov, E., Meuleman, B., Cieciuch, J., Schmidt, P., & Billiet, J. (2014). Measurement equivalence in cross-national research. Annual Review of Sociology, 40, 55–75. https://doi.org/10.1146/annurev-soc-071913-043137.Google Scholar

De Bondt, N., & Van Petegem, P. (2015). Psychometric evaluation of the overexcitability questionnaire-two applying Bayesian structural equation modeling (BSEM) and multiple-group BSEM-based alignment with approximate measurement invariance. Frontiers in Psychology, 6, 1963. https://doi.org/10.3389/fpsyg.2015.01963.CrossRef Google Scholar PubMed

DeMars, C. E. (2020). Alignment as an alternative to anchor purification in DIF analyses. Structural Equation Modeling, 27, 56–72. https://doi.org/10.1080/10705511.2019.1617151.Google Scholar

Dorans, N. J., & Cook, L. L. (2016). Fairness in educational assessment and measurement. Routledge. https://doi.org/10.4324/9781315774527.Google Scholar

Duckworth, A. L., & Yeager, D. S. (2015). Measurement matters: Assessing personal qualities other than cognitive ability for educational purposes. Educational Researcher, 44, 237–251. https://doi.org/10.3102/0013189X15584327.Google Scholar

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates.Google Scholar

Evers, A., Muñiz, J., Hagemeister, C. et al. (2013). Assessing the quality of tests: Revision of the European Federation of Psychologists’ Associations (EFPA) review model. Psichothema, 25, 283–291.Google Scholar

Finch, W. H. (2016). Detection of differential item functioning for more than two groups: A Monte Carlo comparison of methods. Applied Measurement in Education, 29, 30–45. https://doi.org/10.1080/08957347.2015.1102916.Google Scholar

Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8, 370–378. https://doi.org/10.1177/1948550617693063.Google Scholar

Fujimoto, K. A., Gordon, R. A., Peng, F., & Hofer, K. G. (2018). Examining the category functioning of the ECERS-R across eight datasets. AERA Open, 4, 1–16. https://doi.org/10.1177/2332858418758299.Google Scholar

Garcia, N. M., López, N., & Vélez, V. N. (2018). QuantCrit: Rectifying quantitative methods through critical race theory. Race Ethnicity and Education, 21, 149–157. https://doi.org/10.1080/13613324.2017.1377675.CrossRef Google Scholar

Golinski, C., & Cribbie, R. A. (2009). The expanding role of quantitative methodologists in advancing psychology. Canadian Psychology, 50, 83–90. https://doi.org/10.1037/a0015180.Google Scholar

Gordon, R. A. (2015). Measuring constructs in family science: How can IRT improve precision and validity? Journal of Marriage and Family, 77, 147–176. https://doi.org/10.1111/jomf.12157.Google Scholar

Gordon, R. A., Crowder, M. K., Aloe, A. M., Davidson, L. A., & Domitrovich, C. E. (2022). Student self-ratings of social-emotional competencies: Dimensional structure and outcome associations of the WCSD-SECA among Hispanic and non-Hispanic White boys and girls in elementary through high school. Journal of School Psychology, 93, 41–62. https://doi.org/10.1016/j.jsp.2022.05.002.Google Scholar

Gordon, R. A., & Davidson, L. A. (2022). Cross-cutting issues for measuring SECs in context: General opportunities and challenges with an illustration of the Washoe County School District Social-Emotional Competency Assessment (WCSD-SECAs). In Jones, S., Lesaux, N., & Barnes, S. (Eds.), Measuring non-cognitive skills in school settings (pp. 225–251). Guilford Press.Google Scholar

Guttmannova, K., Szanyi, J. M., & Cali, P. W. (2008). Internalizing and externalizing behavior problem scores: Cross-ethnic and longitudinal measurement invariance of the Behavior Problem Index. Educational and Psychological Measurement, 68, 676–694. https://doi.org/10.1177/0013164407310127.Google Scholar

Han, K., Colarelli, S. M., & Weed, N. C. (2019). Methodological and statistical advances in the consideration of cultural diversity in assessment: A critical review of group classification and measurement invariance testing. Psychological Assessment, 31, 1481–1496. https://doi.org/10.1037/pas0000731.CrossRef Google Scholar PubMed

Hauser, R. M., & Goldberger, A. S. (1971). The treatment of unobservable variables in path analysis. Sociological Methodology, 3, 81–117. https://doi.org/10.2307/270819.Google Scholar

Hui, C. H., & Triandis, H. C. (1985). Measurement in cross-cultural psychology: A review and comparison of strategies. Journal of Cross-Cultural Psychology, 16, 131–152. https://doi.org/10.1177/0022002185016002001.Google Scholar

Hussey, I., & Hughes, S. (2020). Hidden invalidity among 15 commonly used measures in social and personality psychology. Advances in Methods and Practices in Psychological Science, 3, 166–184. https://doi.org/10.1177/2515245919882903.CrossRef Google Scholar

Jackson, M. I., & Mare, R. D. (2007). Cross-sectional and longitudinal measurements of neighborhood experience and their effects on children. Social Science Research, 36, 590–610. https://doi.org/10.1016/j.ssresearch.2007.02.002.Google Scholar

Johnson, J. L., & Geisinger, K. F. (2022). Fairness in educational and psychological testing: Examining theoretical, research, practice, and policy implications of the 2014 standards. American Educational Research Association. https://doi.org/10.3102/9780935302967_1.Google Scholar

Kim, E. S., Cao, C., Wang, Y., & Nguyen, D. T. (2017). Measurement invariance testing with many groups: A comparison of five approaches. Structural Equation Modeling, 24, 524–544. https://doi.org/10.1080/10705511.2017.1304822.Google Scholar

King, K. M., Pullman, M. D., Lyon, A. R., Dorsey, S., & Lewis, C. C. (2019). Using implementation science to close the gap between the optimal and typical practice of quantitative methods in clinical science. Journal of Abnormal Psychology, 128, 547–562. https://doi.org/10.1037/abn0000417.Google Scholar

Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking. Springer. https://doi.org/10.1007/978-1-4939-0317-7.Google Scholar

Lai, M. H. C. (2021). Adjusting for measurement noninvariance with alignment in growth modeling. Multivariate Behavioral Research. https://doi.org/10.1080/00273171.2021.1941730.Google Scholar

Lai, M. H. C., Liu, Y., & Tse, W. W. (2022). Adjusting for partial invariance in latent parameter estimation: Comparing forward specification search and approximate invariance methods. Behavior Research Methods, 54, 414–434. https://doi.org/10.3758/s13428-021-01560-2.CrossRef Google Scholar PubMed

Lane, S., Raymond, M. R., & Haladyna, T. M. (2016). Handbook of test development. Routledge.Google Scholar

Lansford, J. E., Rothenberg, W. A., Riley, J. et al. (2021). Longitudinal trajectories of four domains of parenting in relation to adolescent age and puberty in nine countries. Child Development, 92, e493–e512. https://doi.org/10.1111/cdev.13526.Google Scholar

Lee, J., & Wong, K. K. (2022). Centering whole-child development in global education reform international perspectives on agendas for educational equity and quality. Routledge. https://doi.org/10.4324/9781003202714.Google Scholar

Lemann, N. (2000). The big test: The secret history of American meritocracy. Farrar, Straus, and Giroux.Google Scholar

Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140), 1–55.Google Scholar

Liu, Y., Millsap, R. E., West, S. G. et al. (2017). Testing measurement invariance in longitudinal data with ordered-categorical measures. Psychological Methods, 22, 486–506. https://doi.org/10.1037/met0000075.Google Scholar

Long, J. S. (1997). Regression models for categorical and limited dependent variables. Sage.Google Scholar

Long, J. S., & Freese, J. (2014). Regression models for categorical dependent variables using Stata (3rd ed.). Stata Press.Google Scholar

Luong, R., & Flake, J. K. (2022). Measurement invariance testing using confirmatory factor analysis and alignment optimization: A tutorial for transparent analysis planning and reporting. Psychological Methods. https://doi.org/10.1037/met0000441.Google Scholar

Marsh, H. W., Guo, J., Parker, P. D. et al. (2018). What to do when scalar invariance fails: The extended alignment method for multi-group factor analysis comparison of latent means across many groups. Psychological Methods, 23, 524–545. https://doi.org/10.1037/met0000113.Google Scholar

McLeod, J. D., Kruttschnitt, C., & Dornfeld, M. (1994). Does parenting explain the effects of structural conditions on children’s antisocial behavior? A comparison of Blacks and Whites. Social Forces, 73, 575–604. https://doi.org/10.2307/2579822.Google Scholar

McLoyd, V., & Smith, J. (2002). Physical discipline and behavior problems in African American, European American, and Hispanic children: Emotional support as a moderator. Journal of Marriage and Family, 64, 40–53. https://doi.org/10.1111/j.1741-3737.2002.00040.x.Google Scholar

Meade, A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. The Journal of Applied Psychology, 95(4), 728–743. https://doi.org/10.1037/a0018966.Google Scholar

Meade, A. W., & Bauer, D. J. (2007). Power and precision in confirmatory factor analytic tests of measurement invariance. Structural Equation Modeling, 14, 611–635. https://doi.org/10.1080/10705510701575461.Google Scholar

Meitinger, K., Davidov, E., Schmidt, P., & Braun, M. (2020). Measurement invariance: Testing for it and explaining why it is absent. Survey Research Methods, 14, 345–349.Google Scholar

Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5–11. https://doi.org/10.3102/0013189X018002005.Google Scholar

Millsap, R. E. (2011). Statistical approaches to measurement invariance. Routledge. https://doi.org/10.4324/9780203821961.Google Scholar

Morrell, L., Collier, T., Black, P., & Wilson, M. (2017). A construct-modeling approach to develop a learning progression of how students understand the structure of matter. Journal of Research in Science Teaching, 54, 1024–1048. https://doi.org/10.1002/tea.21397.Google Scholar

Moss, P. A. (2016). Shifting the focus of validity for test use. Assessment in Education, 23, 1–16. https://doi.org/10.1080/0969594X.2015.1072085.Google Scholar

Moss, P. A., Pullin, D., Gee, J. P., & Haertel, E. H. (2005). The idea of testing: Psychometric and sociocultural perspectives. Measurement, 3, 63–83. https://doi.org/10.1207/s15366359mea0302_1.Google Scholar

Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17, 313–335. https://doi.org/10.1037/a0026802.Google Scholar

Muthén, B., & Asparouhov, T. (2013). BSEM measurement invariance analysis. Mplus Web Notes: No. 17. www.statmodel.com.Google Scholar

Muthén, B., & Asparouhov, T. (2014). IRT studies of many groups: The alignment method. Frontiers in Psychology, 5. https://doi.org/10.3389/fpsyg.2014.00978.Google Scholar

Muthén, B., & Asparouhov, T. (2018). Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociological Methods & Research, 47(4), 637–664. https://doi.org/10.1177/0049124117701488.Google Scholar

Nering, M., & Ostini, R. (Eds.). (2010). Handbook of polytomous item response theory models. Routledge. https://doi.org/10.4324/9780203861264.Google Scholar

Oakland, T., Douglas, S., & Kane, H. (2016). Top ten standardized tests used internationally with children and youth by school psychologists in 64 countries: A 24-year follow-up study. Journal of Psychoeducational Assessment, 34, 166–176. https://doi.org/10.1177/0734282915595303.Google Scholar

Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning (2nd ed.). Sage. https://doi.org/10.4135/9781412993913.Google Scholar

Parcel, T. L., & Menaghan, E. G. (1988). Measuring behavioral problems in a large cross-sectional survey: Reliability and validity for children of the NLS youth. Unpublished manuscript. Columbus, OH: Center for Human Resource Research, Ohio State University.Google Scholar

Pokropek, A., & Pokropek, E. (2022). Deep neural networks for detecting statistical model misspecifications: The case of measurement invariance. Structural Equation Modeling, 29, 394–411. https://doi.org/10.1080/10705511.2021.2010083.Google Scholar

Pokropek, A., Schmidt, P., & Davidov, E. (2020). Choosing priors in Bayesian measurement invariance modeling: A Monte Carlo simulation study. Structural Equation Modeling, 27, 750–764. https://doi.org/10.1080/10705511.2019.1703708.Google Scholar

Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163. https://doi.org/10.2307/271063.CrossRef Google Scholar

Raju, N., Fortmann-Johnson, K. A., Kim, W. et al. (2009). The item parameter replication method for detecting differential item functioning in the polytomous DFIT framework. Applied Psychological Measurement, 33, 133–147. https://doi.org/10.1177/0146621608319514.Google Scholar

Raju, N.S. (1988). The area between two item characteristic curves. Psychometrika 53, 495–502. https://doi.org/10.1007/BF02294403.Google Scholar

Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14(2), 197–207. https://doi.org/10.1177/014662169001400208.Google Scholar

Rescorla, L. A., Adams, A., Ivanova, M. Y., & International ASEBA Consortium. (2020). The CBCL/1½–5’s DSM‑ASD scale: Confirmatory factor analyses across 24 societies. Journal of Autism and Developmental Disorders, 50, 3326–3340. https://doi.org/10.1007/s10803-019-04189-5.Google Scholar

Rhemtulla, M., Brosseau-Liard, P. E., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17, 354–373. https://doi.org/10.1037/a0029315.Google Scholar

Rimfeld, K., Malanchini, M., Hannigan, L. J. et al. (2019). Teacher assessments during compulsory education are as reliable, stable and heritable as standardized test scores. Journal of Child Psychology and Psychiatry, 60, 1278–1288. https://doi.org/10.1111/jcpp.13070.Google Scholar

Rious, J. B., Cunningham, M., & Spencer, M. B. (2019). Rethinking the notion of “hostility” in African American parenting styles. Research in Human Development, 16, 35–50. https://doi.org/10.1080/15427609.2018.1541377.Google Scholar

Rotberg, I. C. (Ed.). (2010). Balancing change and tradition in global education reform (2nd ed.). Rowman & Littlefield.Google Scholar

Rothstein, M. G., & Goffin, R. D. (2006). The use of personality measures in personnel selection: What does current research support? Human Resource Management Review, 16, 155–180. https://doi.org/10.1016/j.hrmr.2006.03.004.Google Scholar

Royston, P., Altman, D. G., & Sauerbrei, W. (2005). Dichotomizing continuous predictors in multiple regression: A bad idea. Statistics in Medicine, 25, 127–141. https://doi.org/10.1002/sim.2331.Google Scholar

Sablan, J. R. (2019). Can you really measure that? Combining critical race theory and quantitative methods. American Educational Research Journal, 56, 178–203. https://doi.org/10.3102/0002831218798325.Google Scholar

Samejima, F. (1969). Estimation of ability using a response pattern of graded scores. Psychometrika Monograph: No. 17. https://doi.org/10.1007/BF03372160.Google Scholar

Samejima, F. (1996). The graded response model. In van der Linden, W. J. & Hambleton, R. K. (Eds.), Handbook of modern item response theory (pp. 85–100). Springer. https://doi.org/10.1007/978-1-4757-2691-6_5.Google Scholar

Samejima, F. (2010). The general graded response model. In Nering, M. L. & Ostini, R. (Eds.), Handbook of polytomous item response theory models (pp. 77–107). Routledge.Google Scholar

Santori, D. (2020). Test-based accountability in England. Oxford Research Encyclopedias. Oxford University Press. https://doi.org/10.1093/acrefore/9780190264093.013.1454.Google Scholar

Seddig, D., & Lomazzi, V. (2019). Using cultural and structural indicators to explain measurement noninvariance in gender role attitudes with multilevel structural equation modeling. Social Science Research, 84, 102328. https://doi.org/10.1016/j.ssresearch.2019.102328.Google Scholar

Sestir, M. A., Kennedy, L. A., Peszka, J. J., & Bartley, J. G. (2021). New statistics, old schools: An overview of current introductory undergraduate and graduate statistics pedagogy practices. Teaching of Psychology. https://doi.org/10.1177/00986283211030616.Google Scholar

Sharpe, D. (2013). Why the resistance to statistical innovations? Bridging the communication gap. Psychological Methods, 18, 572–582. https://doi.org/10.1037/a0034177.Google Scholar

Shute, R. H., & Slee, P. T. (2015). Child development theories and critical perspectives. Routledge.Google Scholar

Sirganci, G., Uyumaz, G., & Yandi, A. (2020). Measurement invariance testing with alignment method: Many groups comparison. International Journal of Assessment Tools in Education, 7, 657–673. https://doi.org/10.21449/ijate.714218.Google Scholar

Spencer, M. S., Fitch, D., Grogan-Taylor, A., & Mcbeath, B. (2005). The equivalence of the behavior problem index across U.S. ethnic groups. Journal of Cross-Cultural Psychology, 36(5), 573–589. https://doi.org/10.1177/0022022105278 Google Scholar

Sprague, J. (2016). Feminist methodologies for critical researchers: Bridging differences. Rowman & Littlefield.Google Scholar

Strobl, C., Kopf, J., Kohler, L., von Oertzen, T., & Zeileis, A. (2021). Anchor point selection: Scale alignment based on an inequality criterion. Applied Psychological Measurement, 45, 214–230. https://doi.org/10.1177/0146621621990743.Google Scholar

Studts, C. R., Polaha, J., & van Zyl, M. A. (2017). Identifying unbiased items for screening preschoolers for disruptive behavior problems. Journal of Pediatric Psychology, 42, 476–486. https://doi.org/10.1093/jpepsy/jsw090.Google Scholar

Svetina, D., Rutkowski, L., & Rutkowski, D. (2020). Multiple-group invariance with categorical outcomes using updated guidelines: An illustration using Mplus and the lavaan/semtools packages. Structural Equation Modeling, 27, 111–130. https://doi.org/10.1080/10705511.2019.1602776.Google Scholar

Tay, L., Meade, A. W., & Cao, M. (2014). An overview and practice guide to IRT measurement equivalence analysis. Organizational Research Methods, 18, 3–46. https://doi.org/10.1177/1094428114553062.Google Scholar

Walter, M., & Andersen, C. (2016). Indigenous statistics: A quantitative research methodology. Routledge. https://doi.org/10.4324/9781315426570.Google Scholar

Wen, C., & Hu, F. (2022). Investigating the applicability of alignment: A Monte Carlo simulation study. Frontiers in Psychology, 13, 845721. https://doi.org/10.3389/fpsyg.2022.845721.Google Scholar

West, S. G., Taylor, A. B., & Wu, W. (2012). Model fit and model selection in structural equation modeling. In Hoyle, R. H. (Ed.), Handbook of structural equation modeling (pp. 209–231). Guilford Press.Google Scholar

Winter, S. D., & Depaoli, S. (2020). An illustration of Bayesian approximate measurement invariance with longitudinal data and a small sample size. International Journal of Behavioral Development, 44, 371–382. https://doi.org/10.1177/0165025419880610.Google Scholar

Wolfe, E. W., & Smith, E. V. (2007a). Instrument development tools and activities for measure validation using Rasch models: Part I – Instrument development tools. Journal of Applied Measurement, 8, 97–123.Google Scholar

Wolfe, E. W., & Smith, E. V. (2007b). Instrument development tools and activities for measure validation using Rasch models: Part II – Validation activities. Journal of Applied Measurement, 8, 204–234.Google Scholar

Woods, C. M. (2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44, 1–27. https://doi.org/10.1080/00273170802620121 Google Scholar

Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27, 147–170. https://doi.org/10.1177/0265532209349465.Google Scholar

Yoon, M., & Lai, M. H. C. (2018). Testing factorial invariance with unbalanced samples. Structural Equation Modeling, 25, 201–213. https://doi.org/10.1080/10705511.2017.1387859.Google Scholar

Young, M. (2021, June 28). Down with meritocracy. The Guardian: Politics.Google Scholar

Zill, N. (1990). Behavior problems index based on parent report. Unpublished memo. Bethesda, MD: Child Trends.Google Scholar

Zlatkin-Troitschanskaia, O., Toepper, M., Pant, H. A., Lautenbach, C., & Kuhn, C. (Eds.). (2018). Assessment of learning outcomes in higher education: Cross-national comparisons and perspectives. Springer. https://doi.org/10.1007/978-3-319-74338-7.Google Scholar