Hostname: page-component-cd9895bd7-hc48f Total loading time: 0 Render date: 2025-01-06T01:31:01.772Z Has data issue: false hasContentIssue false

Are Sum Scores a Great Accomplishment of Psychometrics or Intuitive Test Theory?

Published online by Cambridge University Press:  01 January 2025

Robert J. Mislevy*
Affiliation:
University of Maryland at College Park
*
Correspondence should be made to Robert J. Mislevy, University of Maryland at College Park, MD, USA. Email: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Sijtsma, Ellis, and Borsboom (Psychometrika, 89:84-117, 2024. https://doi.org/10.1007/s11336-024-09964-7 ) provide a thoughtful treatment in Psychometrika of the value and properties of sum scores and classical test theory at a depth at which few practicing psychometricians are familiar. In this note, I offer comments on their article from the perspective of evidentiary reasoning.

Type
Theory & Methods
Copyright
© 2024 The Author(s), under exclusive licence to The Psychometric Society

The title of this article is a good title but a bad question. It is a false dichotomy. A worse title but more productive question is, “In what sense can sum scores be considered an accomplishment in psychometrics and in what sense can they be considered intuitive test theory?” The following is commentary on evidentiary-reasoning aspects of Sitjsma, Ellis, and Borboom’s article “Recognize the value of the sum score, psychometrics’ greatest accomplishment” (SEB; 2024). I appreciate their case for the value of sum scores and discursion of their underappreciated properties, yet I remain at peace with Henry Braun and I placing sum scores on a list of test-theory phenomenological primitives (p-prims) in our Phi Delta Kappan article “Intuitive test theory” (ITT; Braun & Mislevy, Reference Braun and Mislevy2005).

I will explain my rephrased question and its answers by reviewing Andrea diSessa’s (1993, 2002) account of intuitive physics, then discussing how ITT applies his ideas to the ways that most non-psychometricians think about assessment. I then turn to sum scores as a p-prim, their elevation to scientific test theory, and their bifurcated role in current practice.

1. Intuitive Physics

In 1983, psychologist Andrea diSessa (Reference diSessa, Gentner and Stevens1983) introduced p-prims to explain non-experts’ reasoning about physics. Studying beginning college students’ explanations of everyday kinematics situations, such as the path of a child jumping off a merry-go-round, he found most were based on relatively minimal abstractions of simple common phenomena. In these terms, they viewed the world and to which they appealed as self-contained explanations. Familiar examples are "Heavy objects fall faster than light objects", "Things bounce because they are ’springy’", and "Continuing force is needed for continuing motion." The distinguishing feature of p-prims is that they are the bottom line; they are activated by the surface features of situations and there is no underlying theory connecting them.

Physical p-prims are based on everyday experience. Cannon balls really do fall faster than feathers. Physicists know this, of course, but, when necessary, they can appeal to a deeper level of explanation, to the more sophisticated paradigms of scientific physics. But the thing is how well p-prims work for guiding everyday action: “You can think you are imparting a substance called impetus to the tennis ball when you throw it for your dog, and the ball flies until the impetus wears off. You estimate how much of this substance you want to impart to the ball, and gauge your throw accordingly—and, by golly, the ball goes where you want it to. Your impetus theory is wrong, but neither you, nor the dog, nor the ball knows this, and the job gets done just fine.” (ITT, p. 491)

Here are some of the properties that diSessa (2002, p. 39) proposed for physics p-prims:

  • Many:... The collection of p-prims exhibits some mild degrees of systematicity, but p-prims are loosely coupled. They do not exhibit relations or any other systematicity typically expected of, for example, theories.

  • Work by recognition: A good candidate model for p-prims’ activation and use is recognition. One simply sees them in some situations and not in others.

  • Feelings of naturalness; judgments of plausibility: The prototypical function accomplished by p-prims is to provide a sense of obviousness and necessity to events.

  • And of particular note for sum scores, Development by reorganization: …[M]any p-prims find useful places in the complex system that is an effective scientific concept. A p-prim might come to be known as an effective special case of a scientific principle, and it will be used in place of the principle in apt circumstances. However, p-prims will no longer function as explanatorily primitive. Physics explanations need articulate accountability that p-prims cannot provide.

2. Intuitive Test Theory

The characteristics of physics p-prims resonated with conversations that Henry and I had experienced in discussions with policymakers, governing boards, and administrators about unconventional projects involving adaptive testing, multiple-matrix sampling, multi-level inference, and simulation-based assessments. Adding others we had seen over the years, we identified nine p-prims including “A test measures what it says at the top of the page,” “Any two tests that ‘measure the same thing’ can be made interchangeable with a little ‘equating’ magic,”’ and, finally, “You score a test by adding up scores for items”—in a word, sum scores.

Sum scores tap into a deep intuition of quantity and that more good things is better, which to varying extents people share with apes, frogs, and guppies. In his treatise on evidentiary reasoning in jurisprudence, legal scholar John Henry Wigmore (Reference Wigmore1937) noted that in civil law in medieval Europe, the process of proof rested fundamentally on a numerical system of witnesses swearing oaths to one side or another in a dispute:

It follows, too, since the performance of this act [an oath] is in itself efficacious, that the multiple performances of it [i.e., multiple oath-swearing witnesses], if persons can be obtained who can achieve this, must multiply its probative value proportionately. This numerical conception is inherent in the general formalism of it. …that is, a degree of greater certainty is thought to be attained, not by analyzing the significance of each oath in itself and relatively to the person, but by increasing the number of the oaths (p. 88)

Decades of classroom quizzes, certification examinations, and the drivers’ license test at Motor Vehicles render it a familiar experience to take tests consisting of multiple items and receive a report based on a total score. A higher score is generally better, albeit with grumbling, when items do not seem to represent the targeted capabilities. That is what most people know of test theory, and what they know suffices for most of their purposes.

Hence, sum scores can be considered a phenomenological primitive in intuitive test theory, good enough for drivers taking tests at Motor Vehicles. Perhaps not so for creating those tests, or even more so, developing assessments for unfamiliar forms, purposes, contexts, and technologies. And not so much either, for applications in other behavioral sciences, either in research or for consequential decisions for individuals. Scientists and practitioners, as sophisticated as they may be in their own areas of expertise, in some fields a majority, gather and interpret assessment information using sum scores at the level of ITT, to the detriment of their applications (McNeish, Reference McNeish2024).

3. Scientific Test Theory

Wigmore went on to recount the gradual emergence of examining the credibility of witnesses and the contents of their testimonies. This does not eliminate the notice and potential usefulness of raw numbers of witnesses, but it does place this factor within a larger, more coherent, and more scientific (if still developing) body of theory and practice of evidentiary reasoning in jurisprudence. Among the topics with points of contact with test theory are reliability and credibility of items and of masses of evidence; the necessity of establishing the relevance of items of evidence, individually and in relation to underlying narratives; evidentiary relationships such as chains of reasoning, conditional dependence, and conjunctions and disjunctions; and Bayesian inference networks to make sense of disparate evidence that may arise from virtual any human activity (Kadane and Schum, Reference Kadane and Schum1996; Schum, 2001).

The parallels with test theory are striking, particularly with the roles of sum scores. An early move from sum scores per se is classical test theory (CTT), presaged in Edgeworth (Reference Edgeworth1888, 1890) and Spearman (Reference Spearman1904a, Reference Spearmanb). These are signal events from an evidentiary reasoning perspective, extending attention from evidence to evidence about evidence. I consider this the conceptual leap into scientific test theory. Sum scores and readers’ ratings may indeed provide pertinent information about a phenomenon of interest, gathered sometimes by thoughtful means and sometimes not, grounded sometimes in deep understanding of the phenomenon and sometimes not. CTT provides a way to conceive of, then investigate, certain properties of that evidence, notably reliability, and as such both informs practical applications and checks it from overinterpretation.

As SEB note, CTT is not so much a measurement model, in the sense of telling us substantive meaning of measuring an attribute, but rather it is a noise model. Latent variable (LV) models with semantic content from substantive educational and psychological concepts developed, arguably building on CTT concepts, such as factor analysis (FA) and item response theory (IRT). CTT can be derived in those frames but neither depends on them nor is limited by their assumptions. Parts of SEB bring out such relations. Further, the initial work reported in SEB relating sum scores with network psychometrics demonstrates how sum scores can be useful indicators of salient features with substantive semantic models quite other than LV models. Other parts of SEB push forward the understanding of sumscore and CTT properties in terms of mathematical and statistical properties, and how much one can gain with how few assumptions.

Taken together, a significant practical benefit is to help us discern between situations in which sum scores offer robust indicators of substantive variables in a more complex model (e.g., LV estimates in some counseling surveys) and when instead a more complex model is needed because sum scores do not express the targeted substantive patterns (e.g., meaningfully multidimensional categorical FA models). In light of the spare assumptions required for sum scores properties, even when more complex models are needed in analysis one should “prefer a less well interpretable latent-variable estimate to a true sum-score estimate—the sum score—that researchers and laypeople experience as less alienating and better suited for communication of test results” (SEB, p. 89). (See Bock, Thissen, & Zimowski, 1997, and Mislevy, 2003, for arguments for doing so in large-scale educational assessments.)

4. Conclusion

I will conclude with comments on two quotes from SEB that summarize my views. Firstly,

We conclude that ‘You Score a Test by Adding up Scores for Items’ may have been intuitive at the start of psychometrics more than a century ago but also note that, if based on intuition, it proved a highly fortunate hunch that has been substantiated through a century of psychometric theory formation (SEB, p. 93).

I agree. Sum scores arose as intuition, but in contributions beginning more than a century ago and continuing today have extended the notion into a rich body of theory, with connections to other families of psychometric models and elucidations of mathematical and statistical properties. These extensions into a scientific paradigm can guide practitioners on when and how to use them gainfully, to recognize their relations with models that are more complex or have connections to substantive theory, and to understand productive and appropriate uses of sum scores in applications.

Criticisms of sum scores based on the idea that they are intuitive rather than scientific are premature and arguably incorrect. (SEB, p. 106; emphasis added)

Yes, in that “the idea that [sum scores] are intuitive rather than scientific” is a false dichotomy. An apt riposte to a criticism based on this flawed idea might be: “It is not that sum scores per se that are scientific or intuitive. It is that a person’s use and conception of sum scores might be either scientific, as grounded in the kinds of connections explored in SEB, or grounded in little more than intuition alone. Let us examine the circumstances.”

Conflict of interest

The author has no conflict of interest to disclose.

Footnotes

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

References

Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34(3), 197–211. https://www.jstor.org/stable/1435442.CrossRefGoogle Scholar
Braun, H. I., Mislevy, R. J. (2005). Intuitive test theory. Phi Delta Kappan, 86(7), 488497.CrossRefGoogle Scholar
diSessa, A. A. (1983). Phenomenology and the evolution of intuition. In Gentner, D., Stevens, A. L. (Eds), Mental models, New York: Psychology Press 1533.Google Scholar
diSessa, A. A. (2002). Why “conceptual ecology” is a good idea. In M. Limón  & L. Mason (Eds.), Reconsidering conceptual change: Issues in theory and practice (pp. 28-60). Dordrecht: Springer Netherlands. https://doi.org/10.1007/0-306-47637-1_2.CrossRefGoogle Scholar
Edgeworth, F. Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society, 51, 599635.Google Scholar
Edgeworth, F. Y. (1890). The element of chance in competitive examinations. Journal of the Royal Statistical Society, 53, pp. 460–475, 644–663.Google Scholar
Kadane, J. B., Schum, D. A. (1996). A probabilistic analysis of the Sacco and Vanzetti evidence, New York: Wiley.Google Scholar
Mislevy, R. J. (2003). Evidentiary relationships among data-gathering methods and reporting scales in surveys of educational achievement. CSE Technical Report #595. Los Angeles: Center for the Study of Evaluation, University of California, Los Angeles. https://files.eric.ed.gov/fulltext/ED480556.pdf.Google Scholar
McNeish, D. (2024). Practical implications of sum scores being psychometrics’ greatest accomplishment. Psychometrika.CrossRefGoogle ScholarPubMed
Schum, D. A. (2001). The evidential foundations of probabilistic reasoning. Northwestern University Press.Google Scholar
Sijtsma, K., Ellis, J. L., Borsboom, D. (2024). Recognize the value of the sum score, psychometrics’ greatest accomplishment. Psychometrika, 89, 84117.CrossRefGoogle ScholarPubMed
Spearman, C. (1904). “General intelligence” objectively determined and measured. American Journal of Psychology, 15, 201292.CrossRefGoogle Scholar
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72101.CrossRefGoogle Scholar
Wigmore, J. H. (1937). The science of judicial proof, 3 Boston: Little, Brown, & Co..Google Scholar