Introduction
The need for efficient and scalable approaches for identifying individuals at risk for preclinical and prodromal Alzheimer’s disease (PAD) is paramount to ongoing clinical trial efforts, emerging decentralized trials, and for identifying individuals who will most benefit from currently available pharmacologic or behavioral treatments, or those on the horizon (Cummings et al., Reference Cummings, Lee, Zhong, Fonseca and Taghva2021; Dorsey et al., Reference Dorsey, Kluger and Lipset2020). Self-administered cognitive measures that can be completed “remotely” (i.e., outside of a typical clinical setting, including at home) are a critical component of an early PAD detection strategy since they require fewer resources to administer and provide easier access to cognitive screening compared to person-administered measures (Ashford et al., Reference Ashford, Veitch, Neuhaus, Nosheny, Tosun and Weiner2021; Papp et al., Reference Papp, Samaroo, Chou, Buckley, Schneider, Hsieh, Soberanes, Quiroz, Properzi, Schultz, García-Magariño, Marshall, Burke, Kumar, Snyder, Johnson, Rentz, Sperling and Amariglio2021; Sabbagh et al., Reference Sabbagh, Boada, Borson, Doraiswamy, Dubois, Ingram, Iwata, Porsteinsson, Possin, Rabinovici, Vellas, Chao, Vergallo and Hampel2020; Sabbagh et al., Reference Sabbagh, Boada, Borson, Doraiswamy, Dubois, Ingram, Iwata, Porsteinsson, Possin, Rabinovici, Vellas, Chao, Vergallo and Hampel2020); for a broader review of digital cognitive assessment for preclinical AD see Ohman et al.(Reference Ohman, Hassenstab, Berron, Scholl and Papp2021). Frequently, tests originally designed for and validated within clinic settings are converted to remote use to increase access for those unable to readily visit research centers or due to necessity during the COVID-19 pandemic (Bauer & Bilder, Reference Bauer, Bilder, Brown, King, Haaland and Crosson2023, in press; Mackin et al., Reference Mackin, Rhodes, Insel, Nosheny, Finley, Ashford, Camacho, Truran, Mosca, Seabrook, Morrison, Narayan and Weiner2021; Marra et al., Reference Marra, Hamlet, Bauer and Bowers2020). The limitation of this “conversion” approach is that tests are not developed specifically with remote self-administration as a priority for test design decisions, which can contribute to mixed findings when performance is compared across settings (Cromer et al., Reference Cromer, Harel, Yu, Valadka, Brunwin, Crawford, Mayes and Maruff2015; Mielke et al., Reference Mielke, Machulda, Hagen, Edwards, Roberts, Pankratz, Knopman, Jack and Petersen2015; Stricker, Lundt, Alden, et al., Reference Stricker, Lundt, Alden, Albertson, Machulda, Kremers, Knopman, Petersen and Mielke2020). There is an urgent need for valid self-administered cognitive assessment tools designed specifically for remote use. Verbal memory measures are among the most sensitive to early changes in the Alzheimer’s disease (AD) process (Caselli et al., Reference Caselli, Langlais, Dueck, Chen, Su, Locke, Woodruff and Reiman2020) but are also challenging to adapt to remote, self-administered methods (Bauer & Bilder, in press).
The Stricker Learning Span (SLS) is a digital computer-adaptive word list memory test specifically designed for remote assessment (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022). The SLS is administered via a web-based multi-device platform designed for unsupervised self-administration of digital cognitive tests, Mayo Test Drive (MTD): Mayo Test Development through Rapid Iteration, Validation and Expansion (DRIVE) (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022). Recent work has highlighted learning as a key deficit in PAD, conceptualized as a failure to benefit from repeated exposure (Lim et al., Reference Lim, Baker, Bruns, Mills, Fowler, Fripp, Rainey-Smith, Ames, Masters and Maruff2020) or a lack of benefit from practice (Duff et al., Reference Duff, Hammers, Dalley, Suhrie, Atkinson, Rasmussen, Horn, Beardmore, Burrell, Foster and Hoffman2017; Machulda et al., Reference Machulda, Hagen, Wiste, Mielke, Knopman, Roberts, Vemuri, Lowe, Jack and Petersen2017). In line with this, the SLS was designed to emphasize learning. The SLS paradigm was influenced by cognitive science principles and neural network process simulations (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022). The SLS stresses the contextual system during learning through use of high frequency word stimuli and variations in word item-level imagery to increase difficulty.
Preliminary support for the feasibility, reliability, and validity of the SLS was previously reported in an all-female older adult sample (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022). Whereas that prior study used traditional approaches to test validation, the current study aimed to establish test validity using a novel approach to avoid the inherent issues with existing validation approaches. For example, one common approach is to correlate a new test with existing cognitive tests. However, existing tests, while well established, are themselves imperfect measures of hypothesized underlying constructs (Bilder & Reise, Reference Bilder and Reise2019). Another frequent approach is to establish validity by examining the ability of a new test to differentiate clinically defined groups. In the AD field, for example, it is common to establish the clinical validity of a new test by comparing individuals who are cognitively “normal” or unimpaired to individuals with mild cognitive impairment (MCI) or dementia; however, this introduces circularity because the use of cognitive tests is central to establishing those syndromal classifications. In vivo biomarkers offer an alternative ground truth for test validation studies that is completely independent of cognitive test performance. This is akin to validation with neuropathological diagnosis at autopsy given the correspondence between antemortem PET imaging and autopsy findings but has the notable benefit of being feasible during life (Chiotis et al., Reference Chiotis, Saint-Aubert, Boccardi, Gietl, Picco, Varrone, Garibotto, Herholz, Nobili and Nordberg2017; Wolters et al., Reference Wolters, Dodich, Boccardi, Corre, Drzezga, Hansson, Nordberg, Frisoni, Garibotto and Ossenkoppele2021). A research framework is now available to use AD biomarkers to characterize participants using the amyloid (A), tau (T) and neurodegeneration (N), or the AT(N), system (Jack et al., Reference Jack, Bennett, Blennow, Carrillo, Dunn, Haeberlein, Holtzman, Jagust, Jessen, Karlawish, Liu, Molinuevo, Montine, Phelps, Rankin, Rowe, Scheltens, Siemers, Snyder, Sperling and Sperling2018). Imaging biomarkers of N are considered nonspecific to AD and will not be included in the current manuscript to limit the number of subgroups. Individuals with evidence of elevated amyloid (A+) are considered to show Alzheimer’s pathologic change. An in vivo biological diagnosis of AD is defined by the presence of both A+ and elevated tau (T+).
The objective of this study was to determine the criterion validity of the SLS. Critically, this validation study was limited to unsupervised completion of the SLS in a remote environment outside of a typical clinical research setting. Our primary study hypothesis (Aim 1) was that remotely administered SLS and in-person-administered Rey’s Auditory Verbal Learning Test (AVLT) would differentiate AD biomarker-defined groups similarly. This hypothesis is tested in groups defined by biomarker status alone (A+ vs A− and A+T+ vs A−T−) to avoid circularity. That is, because the AVLT is considered for diagnostic decision-making as part of the consensus diagnosis process for study participants, it is important that the primary AVLT vs. SLS comparison is independent of diagnosis. Secondary hypotheses included that the SLS would be sensitive to preclinical AD in analyses limited to CU participants (Aim 2), SLS and AVLT would show significant correlations to support convergent validity (Aim 3), and that word list learning vs. delay indices would show similar sensitivity to biologically defined AD (A−T− vs A+T+; Aim 4).
Methods
Most participants were recruited from the Mayo Clinic Study of Aging (MCSA), a longitudinal population-based study of aging among Olmsted County, Minnesota, residents. Participants are randomly sampled by age- and sex-stratified groups using the resources of the Rochester Epidemiology Project medical records-linkage system, which links the medical records from all county providers (St Sauver et al., Reference St Sauver, Grossardt, Yawn, Melton, Pankratz and Brue2012). Participants with dementia are not eligible for MCSA enrollment. Participants complete study visits every 15 months that include a physician exam, study coordinator interview, and neuropsychological testing (Roberts et al., Reference Roberts, Geda, Knopman, Cha, Pankratz, Boeve, Ivnik, Tangalos, Petersen and Rocca2008). The physician exam includes a medical history review, complete neurological exam, and the Short Test of Mental Status (STMS) (Kokmen et al., Reference Kokmen, Smith, Petersen, Tangalos and Ivnik1991). The study coordinator interview with an informant includes the Clinical Dementia Rating® scale (Morris, Reference Morris1993). Participants complete a multi-domain battery of nine neuropsychological tests administered by a psychometrist (Roberts et al., Reference Roberts, Geda, Knopman, Cha, Pankratz, Boeve, Ivnik, Tangalos, Petersen and Rocca2008). The interviewing study coordinator, examining physician, and neuropsychologist initially each make an independent diagnostic determination. A final diagnosis of cognitively unimpaired, MCI (Petersen, Reference Petersen2004) or dementia (American Psychiatric Association, 1994) is then established through consensus agreement (Petersen, Reference Petersen2004; Roberts et al., Reference Roberts, Geda, Knopman, Cha, Pankratz, Boeve, Ivnik, Tangalos, Petersen and Rocca2008). The diagnostic evaluation does not consider prior clinical information, prior diagnoses, SLS performance, or knowledge of biomarker status. Further details about the MCSA study protocol are available (Roberts et al., Reference Roberts, Geda, Knopman, Cha, Pankratz, Boeve, Ivnik, Tangalos, Petersen and Rocca2008).
To enrich the sample for participants with cognitive impairment, additional participants were recruited from the Mayo Alzheimer’s Disease Research Center (ADRC).
This study was completed in accordance with the Helsinki Declaration. The study protocols were approved by the Mayo Clinic and Olmsted Medical Center Institutional Review Boards. All participants provided written informed consent for the primary study protocols (MCSA, ADRC); oral consent (which includes consent provided after reading informed consent elements sent in an email or described verbally) was obtained for the ancillary Mayo Test Drive study protocol approved by Mayo Clinic that covered collection of remote cognitive assessment data. No compensation was provided for participation in the ancillary study.
In vivo neuroimaging markers of amyloid and tau
The most recent imaging available ±3 years of baseline SLS was used. Amyloid and tau positivity is determined using Pittsburgh Compound B PET (PiB-PET) and tau PET (flortaucipir) (Jack et al., Reference Jack, Lowe, Senjem, Weigand, Kemp, Shiung, Knopman, Boeve, Klunk, Mathis and Petersen2008; Jack et al., Reference Jack, Wiste, Weigand, Therneau, Lowe, Knopman, Gunter, Senjem, Jones, Kantarci, Machulda, Mielke, Roberts, Vemuri, Reyes and Petersen2017; Vemuri et al., Reference Vemuri, Lowe, Knopman, Senjem, Kemp, Schwarz, Przybelski, Machulda, Petersen and Jack2017). PET images are acquired using a GE Discovery RX or DXT PET/CT scanner. A global cortical PiB PET standard uptake value ratio (SUVR) is computed by calculating the median uptake over voxels in the prefrontal, orbitofrontal, parietal, temporal, anterior cingulate, and posterior cingulate/precuneus regions of interest (ROIs) for each participant and dividing this by the median uptake over voxels in the cerebellar crus gray matter. For tau PET, we utilize median uptake over the voxels in the meta regions consisting of entorhinal, amygdala, parahippocampal, fusiform, inferior temporal, and middle temporal ROIs normalized to the cerebellar crus gray matter (Jack et al., Reference Jack, Wiste, Weigand, Therneau, Lowe, Knopman, Gunter, Senjem, Jones, Kantarci, Machulda, Mielke, Roberts, Vemuri, Reyes and Petersen2017). Cutoffs to determine amyloid and tau positivity are SUVR ≥1.48 (centiloid 22) (Klunk et al., Reference Klunk, Koeppe, Price, Benzinger, Devous, Jagust, Johnson, Mathis, Minhas, Pontecorvo, Rowe, Skovronsky, Mintun and Mintun2015) and ≥1.25 (Jack et al., Reference Jack, Wiste, Weigand, Therneau, Lowe, Knopman, Gunter, Senjem, Jones, Kantarci, Machulda, Mielke, Roberts, Vemuri, Reyes and Petersen2017) respectively, to maintain consistency with our past Cogstate-focused work (Alden et al., Reference Alden, Pudumjee, Lundt, Albertson, Machulda, Kremers, Jack, Knopman, Petersen, Mielke and Stricker2021; Pudumjee et al., Reference Pudumjee, Lundt, Albertson, Machulda, Kremers, Jack, Knopman, Petersen, Mielke and Stricker2021; Stricker, Lundt, Albertson, et al., Reference Stricker, Lundt, Albertson, Machulda, Pudumjee, Kremers, Jack, Knopman, Petersen and Mielke2020).
Person-administered AVLT completed in clinic
The psychometrist reads a list of 15 words (List A) aloud and asks the participant to repeat back as many words as they can recall. This is repeated 5 times (learning trials 1–5 total). A distractor list (List B) is presented, followed by short delay (Trial 6) of list A words. Recall of List A is again tested after 30 minutes (30-minute delay), followed by written recognition (Ferman et al., Reference Ferman, Lucas, Ivnik, Smith, Willis, Petersen and Graff-Radford2005; Stricker, Christianson, et al., Reference Stricker, Christianson, Lundt, Alden, Machulda, Fields, Kremers, Jack, Knopman, Mielke and Petersen2020). The primary variable for this study is AVLT sum of trials (trials 1–5 total + trial 6 + 30-min recall; range 0–105), which is sensitive to early changes in memory (Jack et al., Reference Jack, Wiste, Weigand, Knopman, Vemuri, Mielke, Lowe, Senjem, Gunter, Machulda, Gregg, Pankratz, Rocca and Petersen2015). Additional variables include correct words on trials 1–5 total (thought to reflect learning), as well as 30-minute delay. Long-term percentage retention (AVLT 30-minute delay / trial 5), thought to reflect storage/savings, is also reported.
Self-administered SLS completed remotely (not in clinic)
All participants completed the SLS remotely and without supervision or assistance. Participants followed a link provided in an email to complete the test session. The SLS is administered via the Mayo Test Drive (MTD) platform (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022).
The SLS is a 5-trial adaptive list learning task (Figure 1). Single words are visually presented sequentially during learning trials. After each list presentation, memory for the word list is tested with 4-choice recognition. Following a computer adaptive testing approach, the SLS starts with eight items and then the number of words either stays the same, increases by five or decreases by two according to pre-specified rules based on percentage of correct responses to extend the floor and ceiling relative to traditional word list memory tests (range 2–23 words; Figure 2). Short delay follows the Symbols Test (Nicosia et al., Reference Nicosia, Aschenbrenner, Balota, Sliwinski, Tahan, Adams, Stout, Wilks, Gordon, Benzinger, Fagan, Xiong, Bateman and Morris2023; Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022); all items presented on any learning trial are tested during delay (range 8–23). Select screenshots have been previously published (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022).
The SLS uses common, high frequency words that are easier to recall, but harder to recognize (Lohnas & Kahana, Reference Lohnas and Kahana2013), as previously described (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022). There are 23 item bins with 4 words each, and words within a bin have similar imagery ratings (Clark & Paivio, Reference Clark and Paivio2004). Each successive item bin has lower imagery ratings, thus increasing the difficulty of subsequent items. Most (90%) of the 92 total words used are on the Dolch sight words reading list (preschool through Grade 3), with half at the preschool level (Dolch, Reference Dolch1936). Each test session randomly selects 1 word from each item bin as the “target,” the 3 others serve as the foils and a 23-item target word list is generated, even if not all items are presented due to the adaptive procedure. To reduce recency effects, the order of item presentation is randomized for each trial and the last item presented is never the first tested. The primary variable is SLS sum of trials (total correct across learning trials 1–5 plus delay, range 0–108). Secondary variables include maximum (max) learning span across any learning trial (range 0–23), total correct across learning trials (1–5 total, range 0–85), and total correct short delay (range 0–23). Percent retention (max span/delay) is also reported to allow within-test comparisons, but this measure is not meant to be compared to AVLT percent retention as differences are expected based on differences in test design.
Inclusion criteria
To be included in this study, participants had to have both SLS sum of trials and AVLT sum of trials available and an amyloid PET scan within 3 years. All but two participants also had a tau PET scan available within 3 years. Participants who completed the SLS as of 7/7/22 were included in this study. Parent study data available as of 8/22/22 were included.
Statistical methods
Demographics and clinical characteristics were descriptively summarized using counts and percentages for categorical data and means and standard deviations for continuous data. Data distributions across groups (A− vs A+ and A−T− vs A+T+) were compared using chi-square tests for categorical variables and linear model ANOVA tests for continuous variables. Pearson correlation coefficients were used to characterize the linear relationship between AVLT and SLS variables. Unadjusted and adjusted Hedge’s G with weighted and pooled standard deviation was used to assess effect size for group comparisons. Unadjusted and adjusted logistic regression models were used to determine the predictive accuracy of AVLT and SLS sum of trials in predicting abnormal amyloid PET (A+ vs A−) and abnormal amyloid and tau PET (A+T+ vs A−T−). To formally compare the ability of both tests to differentiate biomarker-defined groups, the AUROC from models with AVLT were directly compared to models with SLS (Therneau, Reference Therneau2021). For models that adjusted for demographic variables, age, sex, and education were the adjustment terms for both Hedge’s G and logistic regression models. A two-sided p-value <0.05 was considered statistically significant. All analyses were performed using R version 4.1.2.
Results
Participant characteristics
The mean age of the 353 participants was 71.8 (SD = 10.8) years, mean education was 15.7 (SD = 2.4), 53.5% were male, 98.0% were White and 92.6% were cognitively unimpaired (Table 1). MTD remote testing was completed within half a month (mean) of the in-person visit. Participant characteristics by biomarker subgroups are reported in Table 2. As expected, based on known increases in A+ and T+ rates with increased age (Jack et al., Reference Jack, Lowe, Senjem, Weigand, Kemp, Shiung, Knopman, Boeve, Klunk, Mathis and Petersen2008; Jack et al., Reference Jack, Wiste, Weigand, Therneau, Lowe, Knopman, Gunter, Senjem, Jones, Kantarci, Machulda, Mielke, Roberts, Vemuri, Reyes and Petersen2017; Vemuri et al., Reference Vemuri, Lowe, Knopman, Senjem, Kemp, Schwarz, Przybelski, Machulda, Petersen and Jack2017), biomarker positive groups were older than biomarker negative groups (p’s < .05). Biomarker positive and negative groups showed similar years of education and sex distribution (p’s > .05).
Note. ADRC = Mayo Alzheimer’s Disease Research Center; AVLT = Auditory Verbal Learning Test; AVLT Sum of Trials = AVLT 1–5 total + Trial 6 + 30-minute delay; AVLT Recognition Percent Correct = {[recognition hits + (15 – recognition false positive errors)]/30} × 100; AVLT Retention = AVLT 30-minute delay / Trial 5; CDR = Clinical Dementia Rating Scale; MCSA = Mayo Clinic Study of Aging; MTD = Mayo Test Drive; SLS = Stricker Learning Span; SLS Max Span = maximum number of words recognized across any learning trial; SLS 1–5 Total = sum of words correctly recognized across trials 1–5; SLS Retention = SLS Delay / SLS Max Span; SLS Sum of Trials = SLS 1–5 total + delay. Table used with permission of Mayo Foundation for Medical Education and Research; all rights reserved.
1 n = 8 Mayo Alzheimer’s Disease Research Center (ADRC); n = 345 Mayo Clinic Study of Aging (MCSA)
2 n = 2 Asian, n = 3 Black, n = 2 More than one
3 n = 1 missing
4 n = 1 missing
5 consensus diagnosis not yet available for this participant; CDR = 0; amyloid and tau PET negative status.
6 n = 8 missing; subjective memory concern is “yes” when subjective memory complaints are on the Blessed Memory Test questions 1-4 is marked as worse or question 5 indicates any other problems with thinking or memory.
7 n = 2 missing
8 n = 1 missing due to AVLT Trial 5 score of 0.
Note. A = amyloid; ADRC = Mayo Alzheimer’s Disease Research Center; AVLT = Auditory Verbal Learning Test; AVLT 1–5 Total = sum of words correctly recalled across trials 1–5; AVLT Sum of Trials = AVLT 1–5 total + Trial 6 + 30-minute delay; CDR = Clinical Dementia Rating Scale; CU = Cognitively Unimpaired; MCI = Mild Cognitive Impairment; MCSA = Mayo Clinic Study of Aging; MTD = Mayo Test Drive; SLS = Stricker Learning Span; SLS 1–5 Total = sum of words correctly recognized across trials 1–5; SLS Sum of Trials = SLS 1–5 total + delay; STMS = Kokmen Short Test of Mental Status; T = tau. See Supplemental Table 2 for group difference comparison results for additional SLS and AVLT variables. Table used with permission of Mayo Foundation for Medical Education and Research; all rights reserved.
Aim 1: SLS shows similar ability to differentiate PET-defined biomarker groups compared to the AVLT (all participants)
AUROC comparisons. Total AUROC values for SLS sum of trials vs. AVLT sum of trials were similar for differentiating biomarker groups (p’s > .05; Table 3, Figure 3). This similarity was seen for all pairwise AUROC comparisons (adjusted and unadjusted models; A− vs A+ and A−T− vs A+T+). These four direct AUROC comparisons support our hypothesis and represent the primary test of Aim 1 given that inclusion of all available participants limits concerns about circularity present when the sample is restricted to CU individuals only.
Note. The biomarker negative group is the reference group (e.g., A− vs A+ for part A; A−T− vs A+T+ for part B). AUROC = area under the receiver operating characteristic curve; AVLT = Rey’s Auditory Verbal Learning Test; AVLT sum of trials = trials 1–5 total correct + trial 6 short-delay correct + 30-minute delay correct, in raw score units. SLS = Stricker Learning Span; SLS sum of trials = 1–5 correct + delay correct, in raw score units. Table used with permission of Mayo Foundation for Medical Education and Research; all rights reserved.
1 Note that SLS and AVLT odds ratios cannot be compared directly since these two scores are on a different scale (raw scores are used). AUROC data are the focus of test comparisons. Lower test performance is associated with higher odds of being in the biomarker positive group. For example, for the A−T− vs A+T+ comparison (all participants, unadjusted model OR = 0.638), each 10-point decrease in SLS sum of trials is associated with 57% increased odds of being A+T+ (95% CI 32% – 87%) calculated as e −1 * ln (OR).
2 Note, for yes/no responses the AUROC is equivalent to concordance, the probability that a randomly selected participant with a yes outcome (positive biomarker group) will have a larger predicted probability than a randomly selected participant with a no outcome (negative biomarker group).
Amyloid groups. Unadjusted models using only the primary cognitive variable as the predictor show that both the SLS and AVLT significantly differentiate A− vs A+ (AUROCs of 0.63 and 0.64, respectively). Adjusted models that include demographic variables increase the overall AUROC values of the full model (both AUROCs = 0.76), and both the SLS and AVLT significantly improve biomarker group prediction over and above the demographic variables.
Amyloid and tau groups. Unadjusted models using only the primary cognitive variable as the predictor show that both the SLS and AVLT significantly differentiate individuals without AD biomarkers (A−T−) from those with biological AD (A+T+; AUROCs of 0.72–0.73). Adjusted models that include demographic variables increase the overall AUROC values of the full model (AUROCs of 0.83–0.84), and both the SLS and AVLT significantly improve biomarker group prediction over and above the demographic variables.
Descriptive effect sizes from group difference analyses. We report effect sizes from mean group comparisons to additionally characterize the magnitude of these effects for both unadjusted and adjusted models (see Figure 4 and Supplemental Table 2). The pattern of results for these parametric analyses is similar to that of the non-parametric AUROC analyses.
Aim 2: SLS shows sensitivity to preclinical AD (CU participants only)
Findings show that the SLS is sensitive to preclinical AD, consistent with our Aim 2 hypothesis.
AUROC comparisons. Total AUROC values for SLS sum of trials vs. AVLT sum of trials were similar for differentiating biomarker groups in CU participants (p’s > .05 for each pairwise comparison; Table 3). Note that direct comparisons of SLS and AVLT should be viewed cautiously when results are limited to CU participants given that AVLT has some circularity with diagnosis (is considered by the neuropsychologist in conjunction with eight other in-person neuropsychological tests, and then discussed in consensus meeting) whereas the SLS is independent of diagnosis (results are not available to consensus team members). Thus, SLS results for analyses limited to CU participants can be readily interpreted. AVLT results are presented for reference, but circularity may impact findings.
Amyloid groups. Unadjusted models using only the primary cognitive variable as the predictor show that both the SLS and AVLT significantly differentiate CU A− vs CU A+ (both AUROCs = 0.63). Adjusted models that include demographic variables increase the overall AUROC values of the full model (AUROCs = 0.76–0.77), and both the SLS and AVLT significantly improve biomarker group prediction over and above the demographic variables.
Amyloid and tau groups. Unadjusted models using only the primary cognitive variable as the predictor show that both the SLS and AVLT significantly differentiate CU individuals without AD biomarkers (A−T−) from CU participants with biological AD (A+T+; AUROCs = 0.67–0.69). Adjusted models that include demographic variables increase the overall AUROC values of the full model (AUROCs = 0.81–0.83). The SLS significantly improved biomarker group prediction over and above the demographic variables; the AVLT approached significance (p = 0.06).
Descriptive effect sizes. Effect sizes from mean group comparisons are also reported (see Figure 4 and Supplemental Table 2).
Aim 3: Convergent validity
SLS sum of trials and AVLT sum of trials were strongly correlated (r = 0.62, p < .001). Additional correlations are reported in Table 4 and Supplemental Figure 1.
Note. All correlations are significant (p’s < 0.001). Correlations in bold show the relationship between the most similar AVLT and SLS measures. AVLT = Auditory Verbal Learning Test; AVLT Sum of Trials = AVLT 1–5 total + Trial 6 + 30-minute delay; SLS = Stricker Learning Span; SLS Max Span = maximum number of words recognized across any of the five learning trials; SLS 1–5 Total = sum of words correctly recognized across trials 1–5; SLS Sum of Trials = SLS 1–5 total + delay. See Supplemental Figure 1 for a full correlation matrix with additional measures. Table used with permission of Mayo Foundation for Medical Education and Research; all rights reserved.
Aim 4: Learning measures show similar ability as delay memory measures to differentiate A−T− and A+T+ groups for both the SLS and the AVLT (all participants)
All secondary SLS and AVLT variables show results in the expected direction with lower performance in the A+T+ compared to A−T− group (p’s < .05 for both unadjusted and adjusted analyses), with generally similar effect sizes across comparable SLS and AVLT variable pairs (Figure 5). Within-test descriptive comparisons show that learning variables (1–5 total) show effect sizes that are similar in magnitude as delay variables. For example, SLS 1–5 (g = −.86) and delay (g = −.88) both show large unadjusted effect sizes, and AVLT 1–5 (g = −.82) and delay (g = −.86) also show large unadjusted effect sizes (Figure 4). Trials 1–5 total (g = −.86 SLS and −.82 AVLT) may be a slightly more advantageous learning measure for group discrimination relative to SLS max span (−.81) or AVLT Trial 5 (−.74) (Supplemental Table 2). Similarly, comparison of two different types of delayed memory measures suggests a slight advantage for delay total correct relative to retention (−.88 vs −.55 for SLS, respectively and −.86 vs −.79 for AVLT, respectively). See Figure 6 for a visualization of trial-by-trial data for both the SLS and AVLT. See Supplemental Tables 1 and 2 for results of adjusted analyses, other subgroup comparisons, and results of other memory tests.
Discussion
This study followed a novel approach to test validation and established the criterion validity of an unsupervised computer adaptive word list memory test (SLS) completed outside of a clinic setting. The SLS differentiates AD biomarker-defined groups as well as a traditional word list recall test administered by trained psychometrists in a clinic setting (AVLT). Specifically, our Aim 1 hypothesis was supported by AUROC comparisons that showed remotely administered SLS sum of trials and in-person-administered AVLT sum of trials have comparable ability to differentiate individuals on the Alzheimer’s continuum (A+) or not (A−) and individuals meeting a research framework for a biological diagnosis of AD (A+T+) or not (A−T−) in a predominantly cognitively unimpaired sample.
In line with our prior results showing that the AVLT has the potential to be useful for detecting subtle objective cognitive decline in preclinical AD (Stricker, Lundt, Albertson, et al., Reference Stricker, Lundt, Albertson, Machulda, Pudumjee, Kremers, Jack, Knopman, Petersen and Mielke2020), our current results extend this prior AVLT finding and suggest that the SLS also has promise in this regard (Aim 2). Specifically, when limiting the sample to CU participants, our AUROC results show that the SLS by itself could help predict, better than chance, which individuals had elevated brain amyloid vs. did not and which had elevated brain amyloid and tau vs did not. In contrast, our prior work examining the utility of the Learning/Working Memory index comprised of visual recognition and working memory tasks from the Cogstate Brief Battery administered in clinic did not significantly differentiate biomarker groups better than chance (CU A−T− vs CU A+T+ or CU A−T− vs CU A+T−) and showed that the AVLT was significantly better than the Learning/Working Memory Index for differentiating CU A−T− vs CU A+T− when comparing total AUROCs. While the predictive ability of the SLS by itself is relatively modest, predictive ability improves when demographic variables are added to the model, and the SLS continues to show an independent effect over and above demographic variables. For example, a model with age, sex, education, and SLS sum of trials together had an AUROC of 0.83 for predicting A+T+ status in CU participants. Thus, our current results suggest that the SLS could be a scalable, easily accessible addition to a multivariable model approach that improves overall prediction of AD risk. For example, the addition of a word list recall measure has previously shown added utility for predicting elevated brain amyloid in individuals without dementia, over and above age alone or age combined with APOE ϵ4 carrier status (Maserejian et al., Reference Maserejian, Bian, Wang, Jaeger, Syrjanen, Aakre, Jack, Mielke and Gao2019). Given its capacity for remote self-administration, the SLS would be a good candidate screening measure in nonspecialty care settings to use in combination with such predictive models to inform the need for further work-up. Plasma biomarkers will also likely be a critical component of future predictive models, particularly for preclinical disease stages (Brand et al., Reference Brand, Lawler, Bollinger, Li, Schindler, Li, Lopez, Ovod, Nakamura, Shaw, Zetterberg, Hansson, Bateman and Bateman2022). Establishing evidence of some independent utility for cognitive measures of interest is an important first step prior to inclusion in research to develop such models in the future. It will also be critical for any such future investigations to include adequate representation of individuals from under-represented groups to ensure broader applicability of results (Ashford et al., Reference Ashford, Veitch, Neuhaus, Nosheny, Tosun and Weiner2021).
The highly overlapping results for the ability of the SLS and AVLT to discriminate AD biomarker groups are particularly interesting given that although the SLS was designed to mimic the sensitivity of the AVLT, it was not designed to be a one-to-one adaptation of the AVLT. Results of correlation analyses align with this intent and support our hypothesis that the SLS and AVLT would show a significant correlation (r = 0.62), and therefore further support convergent validity (Aim 3). Our initial pilot study similarly supported the convergent validity of the SLS, but with slightly lower correlation coefficients likely due to homogeneity of that sample (e.g., all female, restricted age range, excluded individuals with dementia), and potentially due to a longer duration between in-person and remote testing in that study (average 10 months) (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022). Our results are particularly notable given that a previous study using an exact computerized replication of the AVLT facilitated by audio recording and speech recognition to allow self-administration on an iPad showed only slightly higher correlations (r = 0.63–0.70) with the AVLT as typically administered in a well-controlled cross-over design completed in a clinic setting (Morrison et al., Reference Morrison, Pei, Novak, Kaufer, Welsh-Bohmer, Ruhmel and Narayan2018).
We also qualitatively compared commonly derived supraspan wordlist indices within each test given that learning indices may have equivalent utility as delayed memory indices for the early detection of AD (Belleville et al., Reference Belleville, Fouquet, Hudon, Zomahoun and Croteau2017; Weissberger et al., Reference Weissberger, Strong, Stefanidis, Summers, Bondi and Stricker2017). Results supported our hypothesis that word list learning measures would show similar sensitivity as word list delay memory measures to biologically defined AD (A−T− vs A+T+; Aim 4). The effect sizes of learning trials and delayed memory are similar (Figure 5). We chose to focus on delay items correct instead of percent retention (i.e., savings) for the comparison to learning indices. It is important to note that retention is largely dependent on specific test design characteristics including the influence of serial position effects (Atkinson & Shiffrin, Reference Atkinson, Shiffrin and Spence1968; Gavett & Horwitz, Reference Gavett and Horwitz2012; Greene et al., Reference Greene, Baddeley and Hodges1996). The SLS randomizes word order to minimize the recency effects often observed in individuals with Alzheimer’s dementia that leads to an over-estimate of true forgetting (Cunha et al., Reference Cunha, Guerreiro, de Mendonca, Oliveira and Santana2012). For example, learning to criterion studies demonstrate that individuals in the early stages of AD take a longer time to reach criterion, which is reflective of lower learning ability; however, once learning is equated by reaching criterion rates of forgetting are similar to healthy control participants (Greene et al., Reference Greene, Baddeley and Hodges1996; Grober & Kawas, Reference Grober and Kawas1997; Stamate et al., Reference Stamate, Logie, Baddeley and Della Sala2020). Accordingly, SLS learning indices (1–5, max span) demonstrate a larger effect size than SLS retention in biologically defined AD versus those without AD biomarkers, whereas AVLT retention effect size is more similar to AVLT learning indices (see Supplemental Table 2).
This study has several strengths. Most studies examining the ability of cognitive measures to differentiate biomarker groups have focused on differentiation of A+ versus A− individuals (Baker et al., Reference Baker, Lim, Pietrzak, Hassenstab, Snyder, Masters and Maruff2017; Duke Han et al., Reference Duke Han, Nguyen, Stricker and Nation2017). Our inclusion of tau status defined by PET imaging, in addition to amyloid, is a strength. Our population-based sample helps to increase generalizability to clinical settings where comorbidities are common. Our approach of reporting both unadjusted and adjusted effect sizes illustrates the robust biomarker-group difference effect sizes observed in unadjusted analyses. For example, the SLS showed a large group difference effect size across A−T− and A+T+ groups in all participants (−0.88), that decreased to a medium effect size when controlling for demographics (−0.53) and was further attenuated when limited to CU only (−0.74 unadjusted, −0.37 adjusted). There is growing evidence that cognitive decline is not a normal part of the aging process, but rather is reflective of previously undetected neuropathologies that increase in prevalence with increasing age (Bos et al., Reference Bos, Vos, Jansen, Vandenberghe, Gabel, Estanga, Ecay-Torres, Tomassen, den Braber, Lleó, Sala, Wallin, Kettunen, Molinuevo, Rami, Chetelat, de la Sayette, Tsolaki, Freund-Levi and Visser2018; Boyle et al., Reference Boyle, Wang, Yu, Wilson, Dawe, Arfanakis, Schneider and Bennett2021; Harrington et al., Reference Harrington, Schembri, Lim, Dang, Ames, Hassenstab, Laws, Rainey-Smith, Robertson, Rowe, Sohrabi, Salvado, Weinborn, Villemagne, Masters and Maruff2018). To maximize the utility of cognitive measures for informing risk of AD biomarker positivity, we recommend the use of raw scores and argue that the effect of age should not be routinely “adjusted” away as it decreases the predictive power of cognition.
Several limitations should also be noted. First, because this is a population-based study, the racial and ethnic characteristics reflect that of Olmsted County from which participants are randomly sampled, resulting in a predominantly White, Non-Hispanic sample. Second, the predominantly CU composition of this sample may have decreased AUROC values and magnitude of effect sizes for differentiating biomarker positive and negative groups, as suggested by the highly similar results seen when limiting analyses only to CU participants. A more balanced sample design, with more inclusion of individuals with mild to moderate dementia, could support greater utility of these memory measures for identifying individuals at risk for biomarker positivity; the current results are more relevant for preclinical detection given that 93% of our sample is CU. In addition, the population-based nature of the MCSA sample may produce lower AUROC values for predicting elevated brain biomarkers than studies that use a convenience sample that frequently includes a higher number of individuals at risk of Alzheimer’s disease based on family history of subjective concerns and studies that have strict inclusion criteria to limit potential comorbidities (Maserejian et al., Reference Maserejian, Bian, Wang, Jaeger, Syrjanen, Aakre, Jack, Mielke and Gao2019). MCSA participants have higher rates of comorbid conditions given that exclusionary criteria are limited to terminal illness or hospice care (Roberts et al., Reference Roberts, Geda, Knopman, Cha, Pankratz, Boeve, Ivnik, Tangalos, Petersen and Rocca2008). Third, a majority of the sample (83%) had prior exposure to the AVLT given the longitudinal nature of the MCSA and ADRC studies, thus practice effects could have impacted the ability of the AVLT to discriminate biomarker groups. Because biomarker negative participants benefit more from practice effects than biomarker positive participants, it is possible this could have amplified group difference effects for the AVLT (Alden et al., Reference Alden, Lundt, Twohy, Christianson, Kremers, Machulda, Jack, Knopman, Mielke, Petersen and Stricker2022; Machulda et al., Reference Machulda, Hagen, Wiste, Mielke, Knopman, Roberts, Vemuri, Lowe, Jack and Petersen2017). Future work is needed to replicate these results in a setting where both the SLS and AVLT are baseline administrations. Similarly, the entirely unsupervised and remote approach for the SLS could dampen the sensitivity of the SLS as the results presented in this study include all available remote data.
We capture participant-reported information about test interference, noise in the test environment, and participant comments that can provide additional information about test interruptions or environmental considerations. However, because our goal was to establish the robust criterion validity of the SLS “in the wild,” we did not apply any exclusionary criterion based on this information in the present study. Another reason we did not apply such exclusionary criteria is that individuals who are less able to follow instructions provided for the recommended test environment may be more likely to have cognitive impairment. Thus, increased likelihood of lower test performance in an uncontrolled environment, worsened by environmental distractions, could also be related to risk of cognitive decline. If cognitive screening/risk for cognitive decline is the goal, worse performance in remote settings may help identify risk in a way not captured by controlled clinical settings, adding an element of ecological validity or ability to adapt to a new task without assistance. Future work will examine whether and to what degree these factors may influence test performance, as increased distractions in the home environment can negatively impact performance (Madero et al., Reference Madero, Anderson, Bott, Hall, Newton, Fuseya, Harrison, Myers and Glenn2021). Similarly, individuals with low technological literacy may perform more poorly on the SLS because of a lack of familiarity or comfort with mobile devices or computers. Our approach of allowing individuals choose to use the device they are most comfortable with helps address this to some degree. Even though most adults in the U.S. have access to some device, individuals from disadvantaged backgrounds may not have access to cellular service, wifi or broadband internet at home, although only 7% of Americans report they do not use the internet across any of these access methods and this has dramatically declined since 2000 (Perrin & Atske, Reference Perrin and Atske2021). We also cannot rule out the possibility that some individuals may have written down words to benefit their performance; however, given that this is a research study there would be no apparent incentive to artificially increase performance. In addition, there are elements of test design that help to deter this to some extent. First, word order is randomized for each trial; since the words are not presented in the same order, this makes it more difficult to write them down each time. Second, by the 4th learning trial, for high performers there are 23 words. Thus, it would be quite burdensome to write all the words down. In addition, if this was occurring, this would greatly increase the time to complete the measure. We review the data for outliers with regards to time to completion overall and for each test, and we did not have specific concerns this occurred in the current sample. Finally, while use of a strictly biomarker-defined ground truth is a novel aspect of this study, in vivo PET biomarkers also have some limitations such as high but imperfect reliability (manifested by “noise” in the trajectories of imaging results over time in some individuals), and the fact that PET measures of amyloid and tau pathology have a sensitivity floor and medically significant pathology can exist that lies beneath this detection threshold (Lee et al., Reference Lee, Burkett, Min, Lundt, Albertson, Botha, Senjem, Gunter, Schwarz, Jones, Knopman, Jack, Petersen and Lowe2022). Also, we adopted a liberal window for inclusion of available biomarker data to allow for some missed scanning opportunities during the COVID-19 pandemic, to maximize the sample size and because of the generally good stability of amyloid and tau classifications (Jack et al., Reference Jack, Wiste, Therneau, Weigand, Knopman, Mielke, Lowe, Vemuri, Machulda, Schwarz, Gunter, Senjem, Graff-Radford, Jones, Roberts, Rocca and Petersen2019), but this also decreases study precision relative to a narrower time window.
In summary, SLS test design prioritized remote assessment needs and a computer-adaptive approach (Stricker et al., Reference Stricker, Corriveau-Lecavalier, Wiepert, Botha, Jones and Stricker2022). Even though the SLS is not a direct adaptation of the AVLT, our results show highly similar ability of the remotely-administered SLS and in-person-administered AVLT to differentiate AD biomarker-defined groups. These results challenge preconceived notions about memory assessment by showing that creative use of a recognition memory paradigm that emphasizes learning in an all-remote unsupervised sample differentiates AD biomarker-defined groups as effectively as a traditional word list memory measure based on free recall responses.
Acknowledgements
Research reported in this publication was supported by the Kevin Merszei Career Development Award in Neurodegenerative Diseases Research IHO Janet Vittone, MD, the Rochester Epidemiology Project (R01 AG034676), the National Institute on Aging of the National Institutes of Health (grant numbers R21 AG073967, P30 AG062677, P50 AG016574, U01 AG006786, RF1 AG55151, R01 AG041851, R37 AG011378), the Robert Wood Johnson Foundation, The Elsie and Marvin Dekelboum Family Foundation, GHR Foundation, Alzheimer’s Association, and the Mayo Foundation for Education and Research. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. A Mayo Clinic invention disclosure has been submitted for the Stricker Learning Span and the Mayo Test Drive platform (NHS, JLS). We have no other conflicts of interest to disclose related to this work. The authors wish to thank the participants and staff at the Mayo Clinic Study of Aging and Mayo Alzheimer’s Disease Research Center.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S1355617723000322.