Unsupervised high-frequency smartphone-based cognitive assessments are reliable, valid, and feasible in older adults at risk for Alzheimer’s disease

Jessica Nicosia; Andrew J. Aschenbrenner; David A. Balota; Martin J. Sliwinski; Marisol Tahan; Sarah Adams; Sarah S. Stout; Hannah Wilks; Brian A. Gordon; Tammie L. S. Benzinger; Anne M. Fagan; Chengjie Xiong; Randall J. Bateman; John C. Morris; Jason Hassenstab

doi:10.1017/S135561772200042X

Unsupervised high-frequency smartphone-based cognitive assessments are reliable, valid, and feasible in older adults at risk for Alzheimer’s disease

Published online by Cambridge University Press: 05 September 2022

Jessica Nicosia ,

Andrew J. Aschenbrenner

David A. Balota ,

Martin J. Sliwinski ,

Brian A. Gordon and

Tammie L. S. Benzinger

...Show all authors

Show author details

Jessica Nicosia: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA
Andrew J. Aschenbrenner: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA
David A. Balota: Affiliation:
Department of Psychological & Brain Sciences, Washington University in St. Louis, St. Louis, MO, USA
Martin J. Sliwinski: Affiliation:
Department of Human Development and Family Studies, The Pennsylvania State University, University Park, PA, USA
Marisol Tahan: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA
Sarah Adams: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA
Sarah S. Stout: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA
Hannah Wilks: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA
Brian A. Gordon: Affiliation:
Department of Psychological & Brain Sciences, Washington University in St. Louis, St. Louis, MO, USA Department of Radiology, Washington University, School of Medicine, St. Louis, MO, USA
Tammie L. S. Benzinger: Affiliation:
Department of Radiology, Washington University, School of Medicine, St. Louis, MO, USA
Anne M. Fagan: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA
Chengjie Xiong: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA Division of Biostatistics, Washington University, School of Medicine, St. Louis, MO, USA
Randall J. Bateman: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA
John C. Morris: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA
Jason Hassenstab*: Affiliation:
Charles F. and Joanne Knight Alzheimer Disease Research Center, Department of Neurology, Washington University, School of Medicine, St. Louis, MO, USA Department of Psychological & Brain Sciences, Washington University in St. Louis, St. Louis, MO, USA
*: Corresponding author: Jason Hassenstab, email: [email protected]

Article contents

Abstract
Objective:
Methods:
Results:
Conclusions:
Introduction
Methods
Results
Discussion
Supplementary material
Author contributions
Funding statement
Conflicts of interest
Footnotes
References

Rights & Permissions

Abstract

Objective:

Smartphones have the potential for capturing subtle changes in cognition that characterize preclinical Alzheimer’s disease (AD) in older adults. The Ambulatory Research in Cognition (ARC) smartphone application is based on principles from ecological momentary assessment (EMA) and administers brief tests of associative memory, processing speed, and working memory up to 4 times per day over 7 consecutive days. ARC was designed to be administered unsupervised using participants’ personal devices in their everyday environments.

Methods:

We evaluated the reliability and validity of ARC in a sample of 268 cognitively normal older adults (ages 65–97 years) and 22 individuals with very mild dementia (ages 61–88 years). Participants completed at least one 7-day cycle of ARC testing and conventional cognitive assessments; most also completed cerebrospinal fluid, amyloid and tau positron emission tomography, and structural magnetic resonance imaging studies.

Results:

First, ARC tasks were reliable as between-person reliability across the 7-day cycle and test-retest reliabilities at 6-month and 1-year follow-ups all exceeded 0.85. Second, ARC demonstrated construct validity as evidenced by correlations with conventional cognitive measures (r = 0.53 between composite scores). Third, ARC measures correlated with AD biomarker burden at baseline to a similar degree as conventional cognitive measures. Finally, the intensive 7-day cycle indicated that ARC was feasible (86.50% approached chose to enroll), well tolerated (80.42% adherence, 4.83% dropout), and was rated favorably by older adult participants.

Conclusions:

Overall, the results suggest that ARC is reliable and valid and represents a feasible tool for assessing cognitive changes associated with the earliest stages of AD.

Keywords

digital biomarkers mobile testing preclinical Alzheimer’s disease ecological momentary assessment

Type: Research Article
Information: Journal of the International Neuropsychological Society , Volume 29 , Issue 5 , June 2023 , pp. 459 - 471

DOI: https://doi.org/10.1017/S135561772200042X [Opens in a new window]
Copyright: Copyright © INS. Published by Cambridge University Press, 2022

Introduction

There have been remarkable developments in fluid and neuroimaging biomarkers that track the progression of Alzheimer’s disease (AD). AD biomarkers can identify pathological changes in amyloid and tau that occur well before symptom onset (Barthélemy et al., Reference Barthélemy, Li, Joseph-Mathurin, Gordon, Hassenstab, Benzinger, Buckles, Fagan, Perrin, Goate, Morris, Karch, Xiong, Allegri, Chrem Mendez, Berman, Ikeuchi, Mori, Shimada and McDade2020; Bateman et al., Reference Bateman, Benzinger, Berry, Clifford, Duggan, Fagan, Fanning, Farlow, Hassenstab, McDade, Mills, Paumier, Quintana, Salloway, Santacruz, Schneider, Wang and Xiong2017; Price et al., Reference Price, McKeel, Buckles, Roe, Xiong, Grundman, Hansen, Petersen, Parisi, Dickson, Smith, Davis, Schmitt, Markesbery, Kaye, Kurlan, Hulette, Kurland and Morris2009; Sperling et al., Reference Sperling, Aisen, Beckett, Bennett, Craft, Fagan, Iwatsubo, Jack, Kaye, Montine, Park, Reiman, Rowe, Siemers, Stern, Yaffe, Carrillo, Thies, Morrison-Bogorad and Phelps2011). Despite these developments, advances in the measurement of cognitive decline – the essence of the disease phenotype – have lagged behind. Secondary prevention trials targeting abnormal biomarker levels in preclinical (presymptomatic) AD are determined to be successful if they stop or slow cognitive decline (Edgar et al., Reference Edgar, Vradenburg and Hassenstab2019; Food and Drug Administration, 2018). Because the declines in cognition that occur in preclinical AD are subtle, capturing declines, slowing of declines, or improvements require reliable cognitive tests that are sensitive to AD pathological processes. However, standard cognitive assessment tools used in AD studies include classic neuropsychological tests that were originally designed to detect overt cognitive impairments or measure facets of intelligence (Sheehan, Reference Sheehan2012; Weintraub et al., Reference Weintraub, Salmon, Mercaldo, Ferris, Graff-Radford, Chui, Cummings, DeCarli, Foster, Galasko, Peskind, Dietrich, Beekly, Kukull and Morris2009; Woodford & George, Reference Woodford and George2007) and often place heavy burden on participants. This poses a critical hurdle for randomized controlled trials (RCTs) examining therapeutics in preclinical and early-stage symptomatic AD populations. Measures with sub-optimal reliability require larger sample sizes to detect cognitive benefits, particularly when the expected effects are subtle (Dodge et al., Reference Dodge, Zhu, Mattek, Austin, Kornfeld and Kaye2015).

Advances in smartphone technology have allowed researchers to embed brief cognitive measures into ecological momentary assessments (EMA). EMA methods investigate psychological states and behaviors as they occur in natural environments (Shiffman et al., Reference Shiffman, Stone and Hufford2008; Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018; Smyth & Stone, Reference Smyth and Stone2003). EMA is defined by several features: (1) data are collected as participants go about their daily lives; (2) assessments are randomly sampled across various occasions to characterize an individual’s average performance on a given variable of interest; and (3) participants perform multiple short assessments to capture behavioral changes over time and across different situations (Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018).

Although traditional laboratory/clinical settings afford precise control over the testing environment, this is not representative of everyday cognitive functioning (Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). The use of smartphone EMAs in cognitive research can assuage ecological validity concerns as participants perform assessments as they go about their daily lives. Additionally, repeated assessments can improve upon the reliability of conventional measures because they are not collected in just one testing session that may be influenced by variability in participants’ day-to-day stress and mood, amongst other factors (Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). In individuals with neurodegenerative disorders, cognitive performance can vary with time of day (Wilks et al., Reference Wilks, Aschenbrenner, Gordon, Balota, Fagan, Musiek, Balls-Berry, Benzinger, Cruchaga, Morris and Hassenstab2021), and day-to-day variability can be exaggerated (Matar et al., Reference Matar, Shine, Halliday and Lewis2020), further exacerbating the impact of conventional measures’ low reliability. With EMA, aggregation across repeated measurements ameliorates effects of within-person variability and improves reliability by estimating average functioning (Shiffman et al., Reference Shiffman, Stone and Hufford2008; Sliwinski, Reference Sliwinski2008; Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). Although ambulatory cognitive testing is not necessarily a replacement of gold standard in-person cognitive testing, smartphone EMAs provide snapshots of cognition that may reveal unique patterns that cannot be captured with conventional testing.

Smartphone-based assessments may offer a more practical and logistically plausible solution for large-scale studies and clinical trials of AD. Allowing individuals to participate in research studies unsupervised, in familiar environments, and using their own devices can increase engagement, reduce experimenter effects (e.g. demand characteristics, “white coat” testing effects), bolster sample size and diversity, and make participation more accessible and inclusive for individuals who may otherwise be unable to come into the laboratory or clinic. Indeed, interest in smartphone studies is growing, and several studies have demonstrated the feasibility and validity of smartphone-based assessments for use in older adults and individuals with preclinical AD (Güsten et al., Reference Güsten, Ziegler, Düzel and Berron2021; Hassenstab et al., Reference Hassenstab, Aschenbrenner, Balota, McDade, Lim, Fagan, Benzinger, Cruchaga, Goate, Morris and Bateman2020; Lancaster et al., Reference Lancaster, Koychev, Blane, Chinner, Chatham, Taylor and Hinds2020; Mackin et al., Reference Mackin, Insel, Truran, Finley, Flenniken, Nosheny, Comacho, Harel, Maruff and Weiner2018; Nicosia et al., Reference Nicosia, Aschenbrenner, Adams, Tahan, Stout, Wilks, Balls-Berry, Morris and Hassenstab2021; Öhman et al., Reference Öhman, Hassenstab, Berron, Schöll and Papp2021; Papp et al., Reference Papp, Samaroo, Chou, Buckley, Schneider, Hsieh, Soberanes, Quiroz, Properzi, Schultz, García-Magariño, Marshall, Burke, Kumar, Snyder, Johnson, Rentz, Sperling and Amariglio2021; Wilks et al., Reference Wilks, Aschenbrenner, Gordon, Balota, Fagan, Musiek, Balls-Berry, Benzinger, Cruchaga, Morris and Hassenstab2021), as well as the potential for high-frequency in-home monitoring to substantially increase the statistical power of therapeutic trials (Dodge et al., Reference Dodge, Zhu, Mattek, Austin, Kornfeld and Kaye2015).

The purpose of the present study was to evaluate the reliability, validity, and feasibility of unsupervised, high-frequency cognitive testing using participants’ personal smartphones. Tasks assessed associate memory, processing speed, and working memory in older adults and individuals with preclinical and early symptomatic AD. If the Ambulatory Research in Cognition smartphone application (ARC) is a reliable, valid, and feasible measure, ARC should: (1) demonstrate high between-subjects and retest reliability; (2) have construct validity (indexed by correlations with correlations with conventional cognitive measures); (3) demonstrate sensitivity to age and AD-related biomarkers; and (4) be well tolerated by older adults regardless of technology familiarity.

Methods

Participants

We recruited participants enrolled in ongoing studies of aging and dementia at the Charles F. and Joanne Knight Alzheimer Disease Research Center (Knight ADRC) at Washington University School of Medicine in St. Louis. ARC was designed to be sensitive to subtle changes in cognition in participants at risk for, or in the earliest stages, of AD, thus enrollment in the ARC study was limited to those with a Clinical Dementia Rating® (CDR®; Morris, Reference Morris1993) of 0 (cognitively normal) or 0.5 (very mild dementia). In-person enrollment began in February of 2020 and was halted in March 2020 due to the SARS-CoV-2 (COVID-19) pandemic. Therefore, beginning April 2020, the majority of participants were enrolled remotely. All participants provided informed consent, and all procedures were approved by the Human Research Protections Office at Washington University in St. Louis and the research was conducted in accordance with the Helsinki Declaration.

Clinical assessment

Clinical status was determined with the CDR which uses a 5-point scale to characterize six domains of cognitive and functional performance (memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care) that are applicable to AD and other dementias (Morris, Reference Morris1993). CDR scores are determined through semi-structured interviews with the participant and an informant (i.e., family member or friend). A CDR score of 0 indicates cognitive normality, 0.5 = very mild dementia, 1 = mild dementia, 2 = moderate dementia, and 3 = severe dementia.

Conventional cognitive assessments

Conventional cognitive measures included measures of verbal fluency (Animals, Vegetables, and Verbal Fluency), episodic memory (Wechsler Memory Scale Paired Associates Recall, Free and Cued Selective Reminding Test (FCSRT) Free Recall, Craft Story 21 immediate and delayed recall), language (the Multilingual Naming Test; MINT), processing speed (Number Span Forward, Number Symbol TestFootnote ¹ ), and working memory (Number Span Backwards; see Hassenstab et al., Reference Hassenstab, Chasse, Grabow, Benzinger, Fagan, Xiong, Jasielec, Grant and Morris2016 and Weintraub et al., Reference Weintraub, Besser, Dodge, Teylan, Ferris, Goldstein, Kramer, Loewenstein, Marson, Mungas, Salmon, Welsh-Bohmer, Zhou, Shirk, Atri, Kukull, Phelps and Morris2018 for additional information). A global composite similar to the Preclinincal Alzheimer’s Cognitive Composite (PACC; Donohue et al., Reference Donohue, Sperling, Salmon, Rentz, Raman, Thomas, Weiner and Aisen2014; Papp et al., Reference Papp, Rentz, Orlovsky, Sperling and Mormino2017) was created by averaging the standardized scores from FCSRT free recall, Animal naming total score, Craft Story 21 delayed recall, and the total correct score from the Number Symbol test such that higher scores indicated better performance (Weintraub et al., Reference Weintraub, Salmon, Mercaldo, Ferris, Graff-Radford, Chui, Cummings, DeCarli, Foster, Galasko, Peskind, Dietrich, Beekly, Kukull and Morris2009).

Ambulatory research in cognition (ARC) application

The ARC smartphone application is based on principles from EMA and administers brief tests of associative memory, processing speed, and working memory up to 4 times per day over 7 consecutive days. Sampling frequency and duration were chosen based on reliability, validity, and effect size estimates reported in Sliwinski et al. (Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). ARC is programmed to run on major operating system (OS) versions (currently iOS 12.0+ and Android OS 8.0+) on iOS and Android devices. Participants were encouraged to use their personal smartphones as long as minimum technical requirements were met. Individuals interested in participating who did not own a smartphone or whose smartphone did not meet our criteria were supplied a device (either iOS or Android) for the duration of the study. Device exclusion criteria included software issues, limited phone storage, physical damage, battery problems, or poor responsivity. A trained study coordinator (M.T.) provided participants with detailed instructions regarding the ARC application, and additional guidance on smartphone basics (including device setup and operation) was given to participants who were less familiar with smartphones. Throughout the study, the study coordinator provided extensive support for participants via phone, videoconferencing, email, and text messaging. Participants are reimbursed at a rate of $0.50 per completed assessment session. To incentivize participation consistency, participants receive bonus payments for completing all 4 sessions any given day ($1.00 per occurrence, max of $7.00), completing at least 2 assessments per day for 7 days ($6.00), and completing at least 21 assessments over 7 days ($5.00). The maximum compensation possible for one 7-day assessment visit was $32.00.

ARC assessment notifications were administered pseudorandomly throughout the participant’s self-reported awake hours, with at least 2 hr between each testing session. For example, if a participant reported waking up at 7 am and going to bed at 10 pm, they would receive four test session notifications between 7 am and 10 pm, separated by at least 2 hr (see Figure 1, top). The ARC cognitive tasks, Grids, Prices, and Symbols (see Figure 1, bottom), were administered in a random order during each session.

Figure 1. ARC design and cognitive tasks. Note. Top demonstrates if a participant reported waking up at 7 am and going to bed at 10 pm, they would receive four test session notifications between 7 am and 10 pm, separated by at least 2 hr. The ARC cognitive tasks, Grids, Prices, and Symbols are displayed on the bottom.

Grids is a spatial working memory task in which high resolution images of three common objects (key, smartphone, and pen) are displayed on a 5 x 5 grid, and participants are asked to remember the locations of the items. After encoding the locations of each item, participants perform a distractor task (identify Fs in grid of Es) before moving to the retrieval phase. At retrieval, participants are asked to tap the locations where the items were shownFootnote ² . Participants perform two trials during each test session (lasting approximately 30–40 s) and, across sessions, stimuli are placed at random locations to protect against retest effects. Scores reflect a Euclidean distance estimate, agnostic to item, such that a higher score indicates retrieval placement farther away from the encoded locations (i.e., higher score indicates worse performance; Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018).

Prices is an associate memory task with a learning and recognition phase. In the learning phase, participants are shown 10 item–price pairs for 3 s per pair and asked to remember the items and their corresponding prices. Items were common shopping items (food and household supplies), and the prices were randomly assigned 3-digit prices containing no repeated digits and no more than two sequential digits. In the recognition phase, participants were presented with two prices and asked to choose which was shown with the item during the learning phase. The price choices were separated by at least $3.00 to avoid ceiling and floor effects (Hassenstab et al., Reference Hassenstab, Aschenbrenner, Balota, McDade, Lim, Fagan, Benzinger, Cruchaga, Goate, Morris and Bateman2020). To protect against retest and interference effects, 40 items, chosen without replacement, are never repeated within the same day, and item–price pairs are never re-presented over the 28 sessions. Trials last approximately 60 s and scores reflect the proportion of recognition trial errors such that higher scores indicate worse performance.

Symbols is a processing speed measure based on a task used by Sliwinski et al. (Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). Participants are shown three randomly assigned pairs of abstract shapes and asked to determine as quickly as possible which of two pairs match one of the three target pairs. To protect against retest effects, item pairs are randomly assigned for each session. Participants complete 12 trials during each session, lasting approximately 20–60 s (duration varied based on participants’ response times (RTs)). Scores reflect RTs on correct trials such that higher scores indicate worse performance. An “ARC composite score” was created in two steps. Z-scores for each task were calculated by subtracting raw scores from the cohort’s mean score and dividing by the cohort’s standard deviation. The Z-scores were then averaged together to form the ARC composite score. Similar to the individual measures, a higher ARC composite score indicated worse performance.

Feasibility and tolerability measures

Technology familiarity was assessed with a novel measure described in Nicosia et al. (Reference Nicosia, Aschenbrenner, Adams, Tahan, Stout, Wilks, Balls-Berry, Morris and Hassenstab2021). Briefly, the assessment combined objective measurements of technology knowledge (technology-related icon recognition) and self-reported ratings of (1) the frequency with which they perform certain smartphone tasks and (2) how difficult it would be for them to perform various technology-related tasks. For the purposes of this study, we report participants’ technology icon recognition, average frequency of smartphone task performance, and average difficulty performing technology-related tasks (for more details see Nicosia et al., Reference Nicosia, Aschenbrenner, Adams, Tahan, Stout, Wilks, Balls-Berry, Morris and Hassenstab2021).

ARC user experience was assessed with a 10-question survey using a 5-point Likert scale to rate aspects of user experience regarding installation, test instructions, frequency of testing, and overall tolerability. Objective measures of feasibility and tolerability included ARC adherence and drop-out rates. Adherence was defined as the number of completed test sessions divided by the total number of assessment sessions (i.e., a participant who completed 21 of 28 sessions would have a 75% adherence rate).

Cerebrospinal fluid collection and processing

Most participants underwent lumbar puncture (LP) to collect cerebrospinal fluid (CSF) following overnight fasting. Participants at the Knight ADRC undergo LP approximately every 3 years; however, CSF collection was postponed in March 2020 due to the pandemic, eliminating the possibility of acquiring more recent samples. Therefore, we limited the use of CSF data to those collected within 5 years of ARC testing (see Table 1; collected on average 2.64 +/- 1.11 years from the first ARC assessment). Twenty to thirty mL of CSF was collected in a 50 mL polypropylene tube via gravity drip using an atraumatic Sprotte 22-gauge spinal needle. CSF was kept on ice and centrifuged at low speed within 2 hr of collection. CSF was then transferred to another 50 mL tube. CSF was aliquoted at 500 µL into polypropylene tubes and stored at −80°C as previously described (Fagan et al., Reference Fagan, Mintun, Mach, Lee, Dence, Shah, LaRossa, Spinner, Klunk, Mathis, DeKosky, Morris and Holtzman2006). Prior to analysis, samples were brought to room temperature per manufacturer instructions. Samples were vortexed and transferred to polystyrene cuvettes for analysis. Concentrations of Aβ40, Aβ42, total tau (tTau), and tau phosphorylated at threonine 181 (pTau) were measured by chemiluminescent enzyme immunoassay using a fully automated platform (LUMIPULSE G1200, Fujirebio, Malvern, PA) according to manufacturer’s specifications. A single lot of reagents were used for all samples.

Table 1. Demographic data

^a Mean (SD); n (%).

^b Welch two sample t-test; Pearson’s Chi-squared test.

^c Gender and race were self-reported.

Neuroimaging

Neuroimaging data were required to be collected within 5 years of ARC (see Table 1; Amyloid positron emission tomography (PET) mean 2.59 +/- 1.04 years, Tau PET mean 2.50 +/- 0.96, and magnetic resonance imaging (MRI) mean 2.55 +/- 1.05 years from the first ARC assessment). Briefly, MRI data were acquired on 3T Siemens scanners and processed using Freesurfer (Fischl et al., Reference Fischl, Van Der Kouwe, Destrieux, Halgren, Ségonne, Salat, Busa, Seidmann, Goldstein, Kennedy, Makris, Rosen and Dale2004) to derive regional volumes and thicknesses. Volumes were adjusted for total intracranial volume (ICV) (see Raz et al., Reference Raz, Lindenberger, Ghisletta, Rodrigue, Kennedy and Acker2008) and a summary thickness composite was calculated (Singh et al., Reference Singh, Chertkow, Lerch, Evans, Dorr and Kabani2006).

Amyloid PET imaging was performed with either florbetapir (18F-AV-45) or Pittsburgh Compound B (PiB). Data were processed with an in-house pipeline using regions of interest derived from FreeSurfer (ttps://github.com/ysu001/PUP; Su et al., Reference Su, D’Angelo, Vlassenko, Zhou, Snyder, Marcus, Blazey, Christensen, Vora, Morris, Mintun and Benzinger2013). A summary standardized uptake value ratios (SUVR) measure was converted to the Centiloid scale (Su et al., 2018, Reference Su, Flores, Wang, Hornbeck, Speidel, Joseph-Mathurin, Gordon, Koeppe, Klunk, Jack, Farlow, Salloway, Snider, Berman, Roberson, Brosch, Jimenez-Velazques, van Dyck and Benzinger2019) in order to combine PiB and florbetapir data. Tau PET imaging with flortaucipir (18F-AV-1451) was summarized using the average SUVRs of the bilateral entorhinal cortex, amygdala, inferior temporal lobe, and lateral occipital cortex (Mishra et al., Reference Mishra, Gordon, Su, Christensen, Friedrichsen, Jackson, Hornbeck, Balota, Morris, Ances and Benzinger2017). SUVRs used a cerebellar cortex reference and were partial volume corrected.

Statistical analyses

Statistical analyses were completed using R (v4.1.0). To characterize the reliability of ARC, descriptive statistics were examined for all ARC and conventional measures. Correlations were used to examine whether ARC captured age-related cognitive declines comparable to conventional cognitive measures. ARC test-retest reliability was assessed based on participants who completed follow-up testing ∼6 months (“visit 2”; on average 6.07 +/- 1.23 months between assessments) and ∼1 year later (“visit 3”; on average 11.84 +/- 0.84 months between assessments). Pearson correlation coefficients with an r of 0.80 to 0.90 were considered “good” reliability (Price et al., Reference Price, Jhangiani and Chiang2015). Intraclass correlations (ICCs), which show how strongly units within the same group resemble each other, were computed to examine test-retest reliability and between-person reliability such that ICCs between 0.75 and 0.90 indicate “good” reliability (Bruton et al., Reference Bruton, Conway and Holgate2000). ARC and conventional cognitive measure correlations were used to examine construct validity. Finally, feasibility and tolerability were assessed by examining: (1) adherence and drop-out rates; (2) correlations between technology familiarity measures and ARC performance; and (3) descriptive statistics from an ARC user experience survey.

Results

Participant characteristics

Of the 316 participants who completed at least one ARC session, 26 were removed due to either low-quality data or unacceptable rates of missing data (>75% missingness) resulting in a sample size of 290 participants (268 CDR 0 s and 22 CDR 0.5 s) ranging from 61 to 97 years of age. As shown in Table 1, all three ARC tasks showed good discrimination between CDR 0 and CDR 0.5 participantsFootnote ³ . Additionally, ARC performance, as indexed by the ARC composite score, did not differ as a function of gender, t(181.46) = 0.63, p = 0.53, or race, t(28.096) = 1.92, p = 0.06, and was modestly associated with education, r = −0.18, p = 0.01.

Descriptive statistics

Table 2 shows the descriptive statistics for the ARC and conventional cognitive measures as well as adherence and drop-out rates. The t-tests comparing ARC task performance of CDR 0 and 0.5 individuals (significant ts 2.12–3.52) were comparable to comparisons with conventional cognitive measures (significant ts 2.17–4.96). CDR 0 s and 0.5 s in this sample did not differ on Number Span Forward, Number Span Backward, or the MINT. Adherence and drop-out rates did not differ as a function of CDR status (ts < 0.38).

Table 2. Descriptive statistics at ARC baseline

Note.

* Indicates p-value < 0.05.

** Indicates p-value < 0.01.

*** Indicates p-value < 0.001.

Between-person reliability

As mentioned above, aggregation of EMA scores across sessions boosts reliability compared to conventional “one-shot” approaches (Shiffman et al., Reference Shiffman, Stone and Hufford2008). Unconditional multilevel mixed models using restricted maximum likelihood were employed for each ARC task to compute between-person reliability scores (Raykov & Marcoulides, Reference Raykov and Marcoulides2006; Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). The reliabilities of scores aggregated across ARC sessions were quite high: 0.81 for Prices, 0.90 for Grids, and 0.98 for Symbols (see Table 3). These reliabilities are based on 21 (75%) sessions of ARC assessments, which reflects the average number of sessions participants completed.

Table 3. ARC reliabilities for individual tasks

Note. ARC participants received 4 sessions/day for 7 day.

Next, we conducted follow-up analyses to determine how many sessions would be required to obtain reliabilities of aggregated scores that ranged from 0.80 to 0.90. Following Sliwinski et al. (Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018), we fit a series of unconditional multilevel mixed models and calculated reliabilities. These results indicated that 19 sessions (or ∼ 5 days) of Prices, 9 sessions (or ∼ 2 days) of Grids, and 2 sessions (or ∼1 day) of Symbols are required to attain reliabilities greater than 0.80 (see Table 3 and Figure 2).

Figure 2. Between-person reliabilities for ARC tasks. Note. Between-person reliabilities for each ARC cognitive task. Following Sliwinski et al. (Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018), a series of unconditional multilevel mixed models were fit to determine how many sessions would be required to obtain good reliability. Blue line indicates 0.85 reliability threshold.

Test-retest reliability

As of manuscript preparation, a subset of participants also completed testing ∼6 months (N = 185) and ∼1 year (N = 83) after their initial visit. Figure 3 displays test-retest reliability for the 6-month and 1-year follow-ups for the individual tasks and ARC composite score. ARC demonstrated high test-retest reliability for individual ARC tasks as well as the ARC composite score at both follow-ups (all ICCs > 0.85). Considering retest effects (Table 4), there were small but significant improvements from visit 1 to visit 2 on Prices, Symbols, and the ARC Composite, but not on Grids. There were no practice effects evident between visits 2 and 3, suggesting that practice effects diminish after completion of the first testing cycle. A detailed analysis of practice effects will be considered in future studies.

Figure 3. ARC Test-retest reliabilities at 6 month (top) and 1 year (bottom) follow-up.

Table 4. ARC test-retest

^a Mean (SD).

^b Welch two sample t-test.

Note. Values represent participants mean score for that visit, values in parentheses represent standard deviations. Significant p-values indicate the presence of a practice effect.

Construct validity

As shown in Figure 4 (right), the ARC composite score was correlated with the global composite score created from the conventional measures (r = −0.53; this was also the case in the CDR 0 sample, r = −0.47), indicating good construct validity. Additionally, Figure 4 (left) displays correlations between ARC and conventional cognitive measures (raw scores), and the top row shows the correlations with age. ARC tasks showed similar correlations with age as the conventional cognitive measures and exhibited convergent validity such that measures were correlated within the same domains. Note that correlations between the conventional and ARC measures are negative because higher scores on the ARC tasks indicate worse performance, whereas higher scores on the conventional cognitive measures indicate better performance (except for the Trailmaking Test Parts A & B), thus the negative correlations displayed in Figure 4 (left) are in the hypothesized direction. Specifically, the Prices task was correlated with conventional memory measures (WMS Associates Recall: r = −0.24, FCSRT free recall: r = −0.32, Craft Story immediate recall: r = −0.22, Craft Story delayed recall: r = −0.27), the Grids task was correlated with all of the conventional cognitive measures (r’s = −0.15 to −0.36), and the Symbols task was correlated with all the conventional cognitive measures but particularly the fluency tasks and the Number Symbol test (Category Fluency Animals: −0.36, Category Fluency Vegetables: −0.40, Verbal Fluency: −0.36, Number Symbol test: −0.57).

Figure 4. ARC, conventional, and AD biomarker correlations. Note. Correlations amongst ARC and conventional measures (raw scores) shown on the left (N = 282). Correlations of the ARC composite score (higher = worse) and global composite score (higher = better), and AD-related biomarkers are shown on the right (Ns = 146 for CSF measures, 212 for amyloid PET, 173 for tau PET, 175 for AD ROI cortical thickness, and 290 for hippocampal volume). Significant correlations (p < 0.05) are displayed with colored circles, non-significant correlations are blank. Because in-clinic and ARC measures have opposing directionality, the negative correlations amongst the conventional and ARC measures are in the hypothesized direction.

Criterion validity

Criterion validity of ARC was examined by comparing ARC and global composite score correlations with AD biomarkers. As shown in Figure 4 (right), the ARC composite score was correlated in the predicted directions with all AD biomarkers. All correlations remained significant after controlling for age, rs > 0.20, ps < 0.02, except for the relationships with the neurodegeneration and tauopathy measures, ps > 0.18. We also examined correlations between the ARC composite score and AD biomarkers with only CDR 0 participants. Correlations in the cognitively normal subsample (CDR 0 individuals) were weaker than in the full sample (see Supplemental Materials Figure 1), as expected, but were consistent with the magnitude of values seen in other studies which have explored such relationships (for example, see Papp et al., Reference Papp, Samaroo, Chou, Buckley, Schneider, Hsieh, Soberanes, Quiroz, Properzi, Schultz, García-Magariño, Marshall, Burke, Kumar, Snyder, Johnson, Rentz, Sperling and Amariglio2021 among others). Additionally, the correlations were comparable to, though slightly weaker than, correlations between the global composite score and AD biomarkers. Specifically, Fisher’s Z test indicated that, compared to the global composite score, all correlations with the ARC composite score were not significantly different except for the correlations with CSF pTau:Aβ42 (Z = −1.96, p = 0.049), Hippocampal Volume (Z = −1.99, p = 0.045), and PET Tau (Z = −2.20, p = 0.03), which were only marginally to slightly weaker. There were no significant differences in correlations between AD biomarkers and the two composite scores in the CDR 0 subsample.

Feasibility and tolerability

Of the 290 participants included in the present analyses, a subset (N = 220) completed the technology familiarity survey. Figure 5 displays the correlations among age, adherence, the technology familiarity measures, and ARC performance. Greater technology-related icon recognition was associated with better performance on Grids (r = −0.16) and Symbols (r = −0.14), but not on Prices (r = −0.02). Self-reported frequency performing smartphone tasks was unrelated to ARC performance, but perceived difficulty performing technology tasks was related to worse performance on all ARC measures (r’s 0.17–0.24). Adherence was correlated with performance on all three ARC measures (though only weakly for Prices) and the ARC composite score, such that participants who completed more sessions tended to perform better on ARC.

Figure 5. Age, technology familiarity, and ARC performance correlations. Note. Of the 290 participants included in the present analyses, 220 completed the technology familarity survey (see Nicosia et al., Reference Nicosia, Aschenbrenner, Adams, Tahan, Stout, Wilks, Balls-Berry, Morris and Hassenstab2021) which assessed the frequency with which participants perform smartphone-related tasks, how difficult participants find various technology-related tasks, and how well participants could recognize technology-related icons. Significant correlations (p < 0.05) are displayed with colored circles whereas non-significant relationships are blank.

A subset of participants (N = 228) also completed a user experience survey after their first ARC visitFootnote ⁴ . As shown in Figure 6, participants reported an overall positive experience with the ARC application, and most reported that they preferred ARC over conventional assessments. Participants reported little difficulty installing the ARC app, were generally unconcerned about privacy, and that completing 2 weeks of ARC testing per year would not be difficult.

Figure 6. ARC user experience survey results. Note. Of the 290 participants included in the present analyses, 228 completed the ARC user experience survey which assessed participants attitudes towards their experience with the ARC application after their first week using it.

Finally, as shown in Tables 1 and 2, adherence rates were quite high at 81% and 79% for CDR 0 and 0.5 participants, respectively. Drop-out rates were low for both groups as well – 4.9% for CDR 0 s and 4.5% for CDR 0.5 s. The high adherence and low drop-out rates suggest that ARC was well tolerated by older adults, even those with very mild dementia.

Discussion

The present study demonstrates that EMA cognitive assessments conducted on individuals’ personal smartphones can be reliable, sensitive to age and AD biomarkers, and are well-tolerated by older adults regardless of technology experience. There were several main findings: first, between-person reliability of the ARC tasks across the 7-day protocol all exceeded 0.85. Second, individual ARC tasks and the ARC composite score showed exceptionally good test-retest reliabilities at 6-month and 1-year follow-ups (ICCs > 0.85). Third, both the individual ARC tasks and the ARC composite score were correlated with conventional measures of the same domain (r’s = −0.22 to −0.57). The composite scores from ARC and conventional measures were also highly correlated (r = −0.53). Fourth, the ARC composite score showed similar validity to the global composite in predicting AD biomarkers. Finally, both cognitively normal older adults and individuals with very mild AD successfully participated in the ARC study remotely, without supervision, and had extremely low drop-out rates. Overall, the results of the present study suggest that high-frequency smartphone-based assessments are promising tools for assessing cognition in clinical studies of aging and neurodegenerative diseases.

Although classic neuropsychological tests, such as episodic memory and executive functioning tests, are regarded as the most sensitive to AD pathology, they were not designed for frequent assessment and can have poor reliability (Calamia et al., Reference Calamia, Markon and Tranel2013). Using measures with suboptimal reliability can impact statistical power and necessitate larger sample sizes or increased measurement frequency. Our results suggest that a high-frequency EMA approach to cognitive assessments may help overcome these challenges. When averaged across sessions, all three ARC tests had excellent between-subject reliability (r’s > 0.85), consistent with Sliwinski et al. (Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). The results also demonstrated that good between-person reliabilities can be achieved with < 7 days of assessments (averaging across 5 days produced reliabilities > 0.80 for all ARC tasks). The Symbols test achieved excellent reliability in just 3–5 sessions, which is remarkable considering that each session requires ∼30–40 s to complete. Although conventional cognitive measures would also receive a boost in reliability if averaged across repeated assessments, it is impractical and burdensome to assess participants at a frequency sufficient to overcome suboptimal reliability. Using an EMA smartphone protocol, researchers can efficiently obtain repeated measurements to boost reliability.

Test-retest reliability studies in AD samples have indicated “adequate” to “excellent” reliability (e.g., Benedict et al., Reference Benedict, Schretlen, Groninger and Brandt1998; Woods et al., Reference Woods, Delis, Scott, Kramer and Holdnack2006) over intervals ranging from several days to several weeks apart. However, cohort studies are typically conducted annually and yield lower reliability estimates. Specifically, test-retest correlations for delayed memory tests, a cornerstone of AD clinical trials (Bateman et al., Reference Bateman, Benzinger, Berry, Clifford, Duggan, Fagan, Fanning, Farlow, Hassenstab, McDade, Mills, Paumier, Quintana, Salloway, Santacruz, Schneider, Wang and Xiong2017; Donohue et al., Reference Donohue, Sperling, Salmon, Rentz, Raman, Thomas, Weiner and Aisen2014; Langbaum et al., Reference Langbaum, Hendrix, Ayutyanont, Chen, Fleisher, Shah, Barnes, Bennett, Tariot and Reiman2014; Ritchie et al., Reference Ritchie, Ropacki, Albala, Harrison, Kaye, Kramer, Randolph and Ritchie2017), can be particularly unsatisfactory, with reliabilities ranging from 0.50 to 0.75 (Calamia et al., Reference Calamia, Markon and Tranel2013; Dikmen et al., Reference Dikmen, Heaton, Grant and Temkin1999; Lo et al., Reference Lo, Humphreys, Byrne and Pachana2012). The increased reliability demonstrated by high-frequency assessments like ARC could substantially reduce sample sizes needed in AD prevention RCTs (Dodge et al., Reference Dodge, Zhu, Mattek, Austin, Kornfeld and Kaye2015).

ARC demonstrated exceptionally high test-retest reliability for the individual ARC tasks and the ARC composite score at 6-month and 1-year follow-ups (all ICCs > 0.85). The Symbols test demonstrated exceptionally high test-retest reliability exceeding its paper and pencil equivalents (i.e., Wechsler Digit Symbol Substitution test and the Symbol-Digit Modalities test which typically have good test-retest reliabilities; Calamia et al., Reference Calamia, Markon and Tranel2013; Pereira et al., Reference Pereira, Costa and Cerqueira2015). Test-retest reliability for the Prices test was also good but trailed behind the Symbols and Grids tests. Relatedly, a version of the Prices test demonstrated good validity and reliability in a recent EMA study of older adults (Thompson et al., Reference Thompson, Harrington, Roque, Strenger, Correia, Jones, Salloway and Sliwinski2022), but was also rated the most difficult and the least enjoyable of three cognitive tasks, reflecting the challenges of designing repeatable episodic memory measures that are reliable, feasible, and tolerable.

Our results also support the construct and predictive validity of ARC. ARC tasks exhibited convergent validity as evidenced by correlations with conventional cognitive measures (r’s −0.22 to −0.57). Similarly, the ARC composite score was correlated with the global composite score (r = −0.53). Albeit smaller than anticipated, the correlations observed here were comparable, if not stronger, than correlations observed in other digital assessment studies including the Cambridge Neuropsychological Test Automated Battery (CANTAB; rs 0.14 to 0.39; Dorociak et al., Reference Dorociak, Mattek, Lee, Leese, Bouranis, Imtiaz, Bernstein, Kaye and Hughes2021; Gills et al., Reference Gills, Glenn, Madero, Bott and Gray2019; Smith et al., Reference Smith, Need, Cirulli, Chiba-Falek and Attix2013). Additionally, the individual ARC tasks and the ARC composite score showed comparable correlations with age as the conventional measures and global composite score. Given well-known associations between age and cognitive performance, these relationships provide evidence that ARC is a valid measure of cognitive aging.

ARC also demonstrated good predictive validity when assessing sensitivity to AD biomarkers. Worse ARC performance was associated with reduced cortical thickness and hippocampal volume (r’s = −0.18 and −0.19, respectively) and increased levels of amyloid and tau (as indexed by both PET and CSF measures; r’s = 0.11 to 0.29). These relationships were comparable, though smaller in magnitude, to AD biomarker correlations with conventional measures suggesting that ARC captures biomarker burden similarly to conventional measures. Correlations in the cognitively normal subsample (CDR 0 individuals) were on par with other studies which have examined such relationships (Braak & Braak, Reference Braak and Braak1991; Papp et al., Reference Papp, Samaroo, Chou, Buckley, Schneider, Hsieh, Soberanes, Quiroz, Properzi, Schultz, García-Magariño, Marshall, Burke, Kumar, Snyder, Johnson, Rentz, Sperling and Amariglio2021; Snitz et al., Reference Snitz, Tudorascu, Yu, Campbell, Lopresti, Laymon, Minhas, Nadkarni, Aizenstein, Klunk, Weintraub, Gershon and Cohen2020; Van Strien et al., Reference Van Strien, Cappaert and Witter2009).

Evaluation of feasibility and tolerability of a smartphone application for use in older adults is critical, and especially so for applications like ARC that require unsupervised daily interactions. Overall, adherence was excellent at 80.42%, exceeding that seen in many remote studies (Pratap et al., Reference Pratap, Neto, Snyder, Stepnowsky, Elhadad, Grant, Mohebbi, Mooney, Suver, Wilbanks, Mangravite, Heagerty, Areán and Omberg2020) and similar to rates observed in other cognitive EMA studies (Sliwinski et al., Reference Sliwinski, Mogle, Hyun, Munoz, Smyth and Lipton2018). A common concern regarding technology use in older adults is that of technology familiarity. Our results demonstrate that greater technology knowledge was associated with better processing speed and visual working memory task performance, but not memory performance. Interestingly, self-reported frequency of smartphone interactions was not related to ARC performance, but those who reported more difficulty interacting with technology tended to perform worse on all ARC measures. However, when the familiarity assessment results were compared to conventional cognitive measures (see Supplemental Materials Figure 2), similar patterns emerged even on nontechnology-related measures like story recall, number span, confrontation naming, and verbal fluency, suggesting that difficulty with technology may also reflect, to some extent, overall cognitive abilityFootnote ⁵ . Finally, considering the high adherence rates, and the overall favorable ratings from the user experience survey, it appears that with adequate instruction and support, older adults are capable and motivated participants in smartphone studies of cognition.

Limitations and future considerations

The findings of this study should be considered in light of several limitations which may be addressed in future studies. First, although the benefits of EMA smartphone studies are clear, it can be unclear whether participants are fully engaging with the assigned tasks. To address this, participants are asked at the end of each session whether they were interrupted during the session. In the analyses presented here, sessions where participants reported being interrupted were removed. Similarly, many ambulatory assessments are limited when researchers do not collect additional contextual information. Participants were asked a battery of environmental questions at the end of each session, and future studies will investigate the impact of these factors on participants’ performance. Second, as noted in the Methods section, if an individual did not have a device which met study criteria, they were supplied a device. Since it is possible this could have introduced bias, several follow-up analyses were run to test for differences in age, technology familiarity, and ARC performance/adherence. As shown in Supplementary Materials Table 2, even though individuals who were supplied with a device were slightly older and less familiar with technology, there were no differences in CDR, ARC task performance, adherence, or AD biomarkers. Third, it is important to note that the Prices task lagged behind the Symbols and Grids tasks in terms of participants’ performance and the between-subjects reliability (possibly due to the difficulty and task demands). Nevertheless, the Prices task showed good reliability and was correlated with age and conventional memory measures. Finally, Knight ADRC participants consist of highly educated and primarily White older adults motivated to engage in extensive imaging and fluid biomarker studies. Future work is needed to determine the feasibility of ARC in more diverse populations.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/S135561772200042X

Author contributions

AA, DAB, MJS, JCM, & JH conceptualized the study and acquired funding. AA, MT, SSS, & HW adminstered the project and supervised data collection. JN, AA, CX, & JH curated the data and conducted statistical analyses. All authors provided critical feedback on manuscript preparation and editing of revisions.

Funding statement

This work was supported by the National Institutes of Health Grants P30AG066444, P01AG03991, and P01AG026276 (PI Morris) and R01AG057840 (PI Hassenstab) and a grant from the BrightFocus Foundation A2018202S (PI Hassenstab). We would also like to thank the Shepard Family Foundation for their financial support.

Conflicts of interest

None.

Footnotes

¹ A computerized task developed and validated at the Knight ADRC that assesses similar constructs as the Wechsler Digit Symbol Substitution task.

² Two versions of the Grids task are included in the present analyses which differed slightly in their retrieval phase instructions. In the original version, participants were asked to tap the locations of the items from encoding. In the new version, participants are shown the items from encoding one at a time and asked to tap the location of that item from encoding. We used a scoring procedure that was agnostic to item such that scores reflect the shortest Euclidian distance between participants’ taps at retrieval and the encoded locations regardless of which item they were placing. Nevertheless, to test whether participants’ scores differed across versions, several t-tests were run to determine if this change in task administration did not dramatically affect participants’ performance. Participants’ scores for the old and new versions did not significantly differ at visit 1, p = .07, or visit 2, p = .14.

³ See Supplemental Table 1 for information on intraindividual variability for the three ARC tasks.

⁴ Because participants completed the user experience survey voluntarily and a subset of 62 participants (21.38%) chose not to complete the survey, it is possible that the survey results may be influenced by selection bias. To test this possibility, we examined ARC task performance and adherence as a function of whether participants completed the user experience survey. These analyses indicated that there were no significant differences in either ARC task performance, ps > 0.24, or adherence, p = 0.82.

⁵ We explored the extent to which “overall cognitive ability,” as indexed by the conventional composite score, may be associated with ARC adherence. As shown in Supplemental Materials Figure 3, individuals who performed better on the conventional measures also showed better ARC adherence.

References

Barthélemy, N. R., Li, Y., Joseph-Mathurin, N., Gordon, B. A., Hassenstab, J., Benzinger, T. L., Buckles, V., Fagan, A. M., Perrin, R. J., Goate, A. M., Morris, J. C., Karch, C. M., Xiong, C., Allegri, R., Chrem Mendez, P., Berman, S. B., Ikeuchi, T., Mori, H., Shimada, H., . . . McDade, E., & the Dominantly Inherited Alzheimer Network. (2020). A soluble phosphorylated tau signature links tau, amyloid and the evolution of stages of dominantly inherited Alzheimer’s disease. Nature Medicine, 26, 398–407.CrossRef Google Scholar PubMed

Bateman, R. J., Benzinger, T. L., Berry, S., Clifford, D. B., Duggan, C., Fagan, A. M., Fanning, K., Farlow, M. R., Hassenstab, J., McDade, E. M., Mills, S., Paumier, K., Quintana, M., Salloway, S. P., Santacruz, A., Schneider, L. S., Wang, G., & Xiong, C. (2017). The DIAN-TU next generation Alzheimer’s prevention trial: Adaptive design and disease progression model. Alzheimer’s & Dementia, 13, 8–19.CrossRef Google Scholar PubMed

Benedict, R. H. B., Schretlen, D., Groninger, L., & Brandt, J. (1998). Hopkins verbal learning test? Revised: Normative data and analysis of inter-form and test-retest reliability. The Clinical Neuropsychologist (Neuropsychology, Development and Cognition: Section D), 12, 43–55.CrossRef Google Scholar

Braak, H., & Braak, E. (1991). Neuropathological stageing of Alzheimer-related changes. Acta neuropathologica, 82, 239–259.CrossRef Google Scholar PubMed

Bruton, A., Conway, J. H., & Holgate, S. T. (2000). Reliability: What is it, and how is it measured? Physiotherapy, 86, 94–99.CrossRef Google Scholar

Calamia, M., Markon, K., & Tranel, D. (2012). Scoring higher the second time around: Meta-analyses of practice effects in neuropsychological assessment. The Clinical Neuropsychologist (Neuropsychology, Development and Cognition: Section D), 26, 543–570.CrossRef Google Scholar PubMed

Calamia, M., Markon, K., & Tranel, D. (2013). The Robust reliability of neuropsychological measures: Meta-analyses of test–retest correlations. The Clinical Neuropsychologist (Neuropsychology, Development and Cognition: Section D), 27, 1077–1105.CrossRef Google Scholar PubMed

Desikan, R. S., Ségonne, F., Fischl, B., Quinn, B. T., Dickerson, B. C., Blacker, D., Buckner, R. L., Dale, A. M., Maguire, P., Hyman, B. T., Albert, M. S., & Killiany, R. J. (2006). An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage, 31, 968–980.CrossRef Google Scholar PubMed

Dikmen, S. S., Heaton, R. K., Grant, I., & Temkin, N. R. (1999). Test–retest reliability and practice effects of expanded Halstead–Reitan neuropsychological test battery. Journal of the International Neuropsychological Society, 5, 346–356.CrossRef Google Scholar PubMed

Dodge, H. H., Zhu, J., Mattek, N. C., Austin, D., Kornfeld, J., & Kaye, J. A. (2015). Use of high-frequency in-home monitoring data may reduce sample sizes needed in clinical trials. PLoS One, 10, e0138095.CrossRef Google Scholar PubMed

Donohue, M. C., Sperling, R. A., Salmon, D. P., Rentz, D. M., Raman, R., Thomas, R. G., Weiner, M., & Aisen, P. S. (2014). The preclinical Alzheimer cognitive composite: measuring amyloid-related decline. JAMA Neurology, 71, 961–970.CrossRef Google Scholar PubMed

Dorociak, K. E., Mattek, N., Lee, J., Leese, M. I., Bouranis, N., Imtiaz, D., Doane, B. M., Bernstein, J. P. K., Kaye, J. A., & Hughes, A. M. (2021). The survey for memory, attention, and reaction time (SMART): Development and validation of a brief web-based measure of cognition for older adults. Gerontology, 67, 740–752.CrossRef Google Scholar PubMed

Edgar, C. J., Vradenburg, G., & Hassenstab, J. (2019). The 2018 revised FDA guidance for early Alzheimer’s disease: Establishing the meaningfulness of treatment effects. The Journal of Prevention of Alzheimer’s Disease, 6, 223–227.Google Scholar PubMed

Fagan, A. M., Mintun, M. A., Mach, R. H., Lee, S.-Y., Dence, C. S., Shah, A. R., LaRossa, G. N., Spinner, M. L., Klunk, W. E., Mathis, C. A., DeKosky, S. T., Morris, J. C., & Holtzman, D. M. (2006). Inverse relation between in vivo amyloid imaging load and cerebrospinal fluid Aβ 42in humans. Annals of Neurology, 59, 512–519.CrossRef Google Scholar

Fischl, B., & Dale, A. M. (2000). Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proceedings of the National Academy of Sciences, 97, 11050–11055.CrossRef Google Scholar

Fischl, B., Van Der Kouwe, A., Destrieux, C., Halgren, E., Ségonne, F., Salat, D. H., Busa, E., Seidmann, L. J., Goldstein, J., Kennedy, D., Caviness, D., Makris, N., Rosen, B, & Dale, A. M. (2004). Automatically parcellating the human cerebral cortex. Cerebral Cortex, 14, 11–22.CrossRef Google Scholar PubMed

Food and Drug Administration. (2018). Early Alzheimer’s disease: Developing drugs for treatment: guidance for industry. Food and Drug Administration.Google Scholar

Gills, J. L., Glenn, J. M., Madero, E. N., Bott, N. T., & Gray, M. (2019). Validation of a digitally delivered visual paired comparison task: Reliability and convergent validity with established cognitive tests. GeroScience, 41, 441–454.CrossRef Google Scholar PubMed

Güsten, J., Ziegler, G., Düzel, E., & Berron, D. (2021). Age impairs mnemonic discrimination of objects more than scenes: A web-based, large-scale approach across the lifespan. Cortex, 137, 138–148.CrossRef Google Scholar PubMed

Hassenstab, J., Aschenbrenner, A. J., Balota, D. A., McDade, E., Lim, Y. Y., Fagan, A. M., Benzinger, T. L. S., Cruchaga, C., Goate, A. M., Morris, J. C., Bateman, R. J., & the Dominantly Inherited Alzheimer Network. (2020). Remote cognitive assessment approaches in the dominantly inherited Alzheimer network (DIAN) using digital technology to drive clinical innovation in brain-behavior relationships: A new era in neuropsychology. Alzheimer’s & Dementia, 16, e038144.CrossRef Google Scholar

Hassenstab, J., Chasse, R., Grabow, P., Benzinger, T. L. S., Fagan, A. M., Xiong, C., Jasielec, M., Grant, E., & Morris, J. C. (2016). Certified normal: Alzheimer’s disease biomarkers and normative estimates of cognitive functioning. Neurobiology of Aging, 43, 23–33.CrossRef Google Scholar PubMed

Hassenstab, J., Nicosia, J., LaRose, M., Aschenbrenner, A. J., Gordon, B. A., Benzinger, T. L., Xiong, C., & Morris, J. C. (2021). Is comprehensiveness critical? Comparing short and long format cognitive assessments in preclinical Alzheimer disease. Alzheimer’s Research & Therapy, 13, 1–14.Google Scholar

Lancaster, C., Koychev, I., Blane, J., Chinner, A., Chatham, C., Taylor, K., & Hinds, C. (2020). Gallery game: Smartphone-based assessment of long-term memory in adults at risk of Alzheimer’s disease. Journal of Clinical and Experimental Neuropsychology, 42, 329–343.CrossRef Google Scholar PubMed

Langbaum, J. B., Hendrix, S. B., Ayutyanont, N., Chen, K., Fleisher, A. S., Shah, R. C., Barnes, L. L., Bennett, D. A., Tariot, P. N., & Reiman, E. M. (2014). An empirically derived composite cognitive test score with improved power to track and evaluate treatments for preclinical Alzheimer’s disease. Alzheimer’s & Dementia, 10, 666–674.CrossRef Google Scholar PubMed

Lo, A. H. Y., Humphreys, M., Byrne, G. J., & Pachana, N. A. (2012). Test-retest reliability and practice effects of the Wechsler memory scale-III. Journal of Neuropsychology, 6, 212–231.CrossRef Google Scholar PubMed

Mackin, R. S., Insel, P. S., Truran, D., Finley, S., Flenniken, D., Nosheny, R., Ulbright, A., Comacho, M., Harel, B., Maruff, P., & Weiner, M. W. (2018). Unsupervised online neuropsychological test performance for individuals with mild cognitive impairment and dementia: Results from the Brain Health Registry. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 10, 573–582.Google Scholar PubMed

Matar, E., Shine, J. M., Halliday, G. M., & Lewis, S. J. (2020). Cognitive fluctuations in Lewy body dementia: Towards a pathophysiological framework. Brain, 143, 31–46.CrossRef Google Scholar PubMed

Mishra, S., Gordon, B. A., Su, Y., Christensen, J., Friedrichsen, K., Jackson, K., Hornbeck, R., Balota, D. A., Cairns, N. J., Morris, J. C., Ances, B. M., & Benzinger, T. L. S. (2017). AV-1451 PET imaging of tau pathology in preclinical Alzheimer disease: Defining a summary measure. Neuroimage, 161, 171–178.CrossRef Google Scholar PubMed

Morris, J. C. (1993). The clinical dementia rating (CDR): Current version and scoring rules. Neurology, 43, 2412–2414.CrossRef Google Scholar PubMed

Nicosia, J., Aschenbrenner, A. J., Adams, S., Tahan, M., Stout, S. H., Wilks, H., Balls-Berry, J. E., Morris, J. C., & Hassenstab, J. (2021). Bridging the technological divide: Stigmas and challenges with technology in clinical studies of older adults. Frontiers in Digital Health, 4, e880055. CrossRef Google Scholar

Öhman, F., Hassenstab, J., Berron, D., Schöll, M., & Papp, K. V. (2021). Current advances in digital cognitive assessment for preclinical Alzheimer’s disease. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 13, e12217.Google Scholar PubMed

Papp, K. V., Rentz, D. M., Orlovsky, I., Sperling, R. A., & Mormino, E. C. (2017). Optimizing the preclinical Alzheimer’s cognitive composite with semantic processing: The PACC5. Alzheimer’s & Dementia: Translational Research & Clinical Interventions, 3, 668–677.Google Scholar PubMed

Papp, K. V., Samaroo, A., Chou, H. C., Buckley, R., Schneider, O. R., Hsieh, S., Soberanes, D., Quiroz, Y., Properzi, M., Schultz, A., García-Magariño, I., Marshall, G. A., Burke, J. G, Kumar, R., Snyder, N., Johnson, K., Rentz, D. M., Sperling, R. A., & Amariglio, R. E. (2021). Unsupervised mobile cognitive testing for use in preclinical Alzheimer’s disease. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 13, e12243.Google Scholar PubMed

Pereira, D. R., Costa, P., & Cerqueira, J. J. (2015). Repeated assessment and practice effects of the written symbol digit modalities test using a short inter-test interval. Archives of Clinical Neuropsychology, 30, 424–434.CrossRef Google Scholar

Pratap, A., Neto, E. C., Snyder, P., Stepnowsky, C., Elhadad, N., Grant, D., Mohebbi, M. H., Mooney, S., Suver, C., Wilbanks, J., Mangravite, L., Heagerty, P. J., Areán, P., & Omberg, L. (2020). Indicators of retention in remote digital health studies: A cross-study evaluation of 100,000 participants. NPJ Digital Medicine, 3, 1–10.CrossRef Google Scholar PubMed

Price, J. L., McKeel, D. W. Jr., Buckles, V. D., Roe, C. M., Xiong, C., Grundman, M., Hansen, L. A., Petersen, R. C., Parisi, J. E., Dickson, D. W., Smith, C. D., Davis, D. G., Schmitt, F. A., Markesbery, W. R., Kaye, J., Kurlan, R., Hulette, C., Kurland, B. F., & Morris, J. C. (2009). Neuropathology of nondemented aging: Presumptive evidence for preclinical Alzheimer disease. Neurobiology of Aging, 30, 1026–1036.CrossRef Google Scholar PubMed

Price, P. C., Jhangiani, R. S., & Chiang, I. C. A. (2015). Reliability and validity of measurement. Research Methods in Psychology-2nd Canadian Edition.Google Scholar

Raykov, T., & Marcoulides, G. A. (2006). On multilevel model reliability estimation from the perspective of structural equation modeling. Structural Equation Modeling, 13, 130–141.CrossRef Google Scholar

Raz, N., Lindenberger, U., Ghisletta, P., Rodrigue, K. M., Kennedy, K. M., & Acker, J. D. (2008). Neuroanatomical correlates of fluid intelligence in healthy adults and persons with vascular risk factors. Cerebral Cortex, 18, 718–726.CrossRef Google Scholar PubMed

Ritchie, K., Ropacki, M., Albala, B., Harrison, J., Kaye, J., Kramer, J., Randolph, C., & Ritchie, C. W. (2017). Recommended cognitive outcomes in preclinical Alzheimer’s disease: Consensus statement from the European prevention of Alzheimer’s dementia project. Alzheimer’s & Dementia, 13, 186–195.CrossRef Google Scholar PubMed

Sheehan, B. (2012). Assessment scales in dementia. Therapeutic Advances in Neurological Disorders, 5, 349–358.CrossRef Google Scholar PubMed

Shiffman, S., Stone, A. A., & Hufford, M. R. (2008). Ecological momentary assessment. Annual Review of Clinical Psychology, 4, 1–32.CrossRef Google Scholar PubMed

Singh, V., Chertkow, H., Lerch, J. P., Evans, A. C., Dorr, A. E., & Kabani, N. J. (2006). Spatial patterns of cortical thinning in mild cognitive impairment and Alzheimer’s disease. Brain, 129, 2885–2893.CrossRef Google Scholar PubMed

Sliwinski, M. J. (2008). Measurement-burst designs for social health research. Social and Personality Psychology Compass, 2, 245–261.CrossRef Google Scholar

Sliwinski, M. J., Mogle, J. A., Hyun, J., Munoz, E., Smyth, J. M., & Lipton, R. B. (2018). Reliability and validity of ambulatory cognitive assessments. Assessment, 25, 14–30.CrossRef Google Scholar PubMed

Smith, P. J., Need, A. C., Cirulli, E. T., Chiba-Falek, O., & Attix, D. K. (2013). A comparison of the Cambridge automated neuropsychological test battery (CANTAB) with “traditional” neuropsychological testing instruments. Journal of Clinical and Experimental Neuropsychology, 35, 319–328.CrossRef Google Scholar PubMed

Smyth, J. M., & Stone, A. A. (2003). Ecological momentary assessment research in behavioral medicine. Journal of Happiness Studies, 4, 35–52.CrossRef Google Scholar

Snitz, B. E., Tudorascu, D. L., Yu, Z., Campbell, E., Lopresti, B. J., Laymon, C. M., Minhas, D. S., Nadkarni, N. K., Aizenstein, H. J., Klunk, W. E., Weintraub, S., Gershon, R. C., & Cohen, A. D. (2020). Associations between NIH Toolbox Cognition Battery and in vivo brain amyloid and tau pathology in non-demented older adults. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 12, e12018.Google Scholar PubMed

Sperling, R. A., Aisen, P. S., Beckett, L. A., Bennett, D. A., Craft, S., Fagan, A. M., Iwatsubo, T., Jack, C. R., Kaye, J., Montine, T. J., Park, D. C., Reiman, E. M., Rowe, C. C., Siemers, E., Stern, Y., Yaffe, K., Carrillo, M. C., Thies, B., Morrison-Bogorad, M., . . . Phelps, C. H. (2011). Toward defining the preclinical stages of Alzheimer’s disease: recommendations from the National Institute on aging-Alzheimer’s association workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimer’s & Dementia, 7, 280–292.CrossRef Google Scholar PubMed

Su, Y., D’Angelo, G. M., Vlassenko, A. G., Zhou, G., Snyder, A. Z., Marcus, D. S., Blazey, T. M., Christensen, J. J., Vora, S., Morris, J. C., Mintun, M. A. & Benzinger, T. L. (2013). Quantitative analysis of PiB-PET with freesurfer ROIs. PloS One, 8, e73377.CrossRef Google Scholar PubMed

Su, Y., Flores, S., Hornbeck, R. C., Speidel, B., Vlassenko, A. G., Gordon, B. A., Koeppe, R. A., Klunk, W. E., Xiong, C., Morris, J. C., & Benzinger, T. L. (2018). Utilizing the Centiloid scale in cross-sectional and longitudinal PiB PET studies. NeuroImage: Clinical, 19, 406–416.CrossRef Google Scholar PubMed

Su, Y., Flores, S., Wang, G., Hornbeck, R. C., Speidel, B., Joseph-Mathurin, N., Vlassenko, A. G., Gordon, B. A., Koeppe, R. A., Klunk, W. E., Jack, C. R., Farlow, M. R., Salloway, S., Snider, B. J., Berman, S. B., Roberson, E. D., Brosch, J., Jimenez-Velazques, I., van Dyck, C. H., . . . Benzinger, T. L. (2019). Comparison of Pittsburgh compound B and florbetapir in cross-sectional and longitudinal studies. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 11, 180–190.Google Scholar PubMed

Thompson, L., Harrington, K., Roque, N., Strenger, J., Correia, S., Jones, R., Salloway, S., & Sliwinski, M. (2022). A highly feasible, reliable, and fully remote protocol for mobile app-based cognitive assessment in cognitively healthy older adults. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring, 14, e12283. https://doi.org/10.1002/dad2.12283 Google Scholar PubMed

Van Strien, N. M., Cappaert, N. L. M., & Witter, M. P. (2009). The anatomy of memory: An interactive overview of the parahippocampal–hippocampal network. Nature Reviews Neuroscience, 10, 272–282.CrossRef Google Scholar PubMed

Weintraub, S., Besser, L., Dodge, H. H., Teylan, M., Ferris, S., Goldstein, F. C., Giordani, B., Kramer, J., Loewenstein, D, Marson, D., Mungas, D., Salmon, D., Welsh-Bohmer, K., Zhou, X-H., Shirk, S. D., Atri, A., Kukull, W. A., Phelps, C., & Morris, J. C. (2018). Version 3 of the Alzheimer disease centers’ neuropsychological test battery in the uniform data set (UDS). Alzheimer Disease and Associated Disorders, 32, 10.CrossRef Google Scholar PubMed

Weintraub, S., Salmon, D., Mercaldo, N., Ferris, S., Graff-Radford, N. R., Chui, H., Cummings, J., DeCarli, C., Foster, N. L., Galasko, D., Peskind, E., Dietrich, W., Beekly, D. L., Kukull, W. A., & Morris, J. C. (2009). The Alzheimer’s disease centers’ uniform data set (UDS): The neuropsychological test battery. Alzheimer Disease and Associated Disorders, 23, 91.CrossRef Google Scholar

Wilks, H. M., Aschenbrenner, A. J., Gordon, B. A., Balota, D. A., Fagan, A. M., Musiek, E., Balls-Berry, J., Benzinger, T. L. S., Cruchaga, C., Morris, J. C., & Hassenstab, J. (2021). Sharper in the morning: Cognitive time of day effects revealed with high-frequency smartphone testing. Journal of Clinical and Experimental Neuropsychology, 43, 825–837. https://doi.org/10.1080/13803395.2021.2009447 CrossRef Google Scholar PubMed

Woodford, H. J., & George, J. (2007). Cognitive assessment in the elderly: A review of clinical methods. QJM: An International Journal of Medicine, 100, 469–484.CrossRef Google Scholar PubMed

Woods, S., Delis, D., Scott, J., Kramer, J., & Holdnack, J. (2006). The California verbal learning test – second edition: Test-retest reliability, practice effects, and reliable change indices for the standard and alternate forms. Archives of Clinical Neuropsychology, 21, 413–420.CrossRef Google Scholar PubMed

Figure 1. ARC design and cognitive tasks. Note. Top demonstrates if a participant reported waking up at 7 am and going to bed at 10 pm, they would receive four test session notifications between 7 am and 10 pm, separated by at least 2 hr. The ARC cognitive tasks, Grids, Prices, and Symbols are displayed on the bottom.

Table 1. Demographic data

Table 2. Descriptive statistics at ARC baseline

Table 3. ARC reliabilities for individual tasks

Figure 2. Between-person reliabilities for ARC tasks. Note. Between-person reliabilities for each ARC cognitive task. Following Sliwinski et al. (2018), a series of unconditional multilevel mixed models were fit to determine how many sessions would be required to obtain good reliability. Blue line indicates 0.85 reliability threshold.

Figure 3. ARC Test-retest reliabilities at 6 month (top) and 1 year (bottom) follow-up.

Table 4. ARC test-retest

Figure 4. ARC, conventional, and AD biomarker correlations. Note. Correlations amongst ARC and conventional measures (raw scores) shown on the left (N = 282). Correlations of the ARC composite score (higher = worse) and global composite score (higher = better), and AD-related biomarkers are shown on the right (Ns = 146 for CSF measures, 212 for amyloid PET, 173 for tau PET, 175 for AD ROI cortical thickness, and 290 for hippocampal volume). Significant correlations (p < 0.05) are displayed with colored circles, non-significant correlations are blank. Because in-clinic and ARC measures have opposing directionality, the negative correlations amongst the conventional and ARC measures are in the hypothesized direction.

Figure 5. Age, technology familiarity, and ARC performance correlations. Note. Of the 290 participants included in the present analyses, 220 completed the technology familarity survey (see Nicosia et al., 2021) which assessed the frequency with which participants perform smartphone-related tasks, how difficult participants find various technology-related tasks, and how well participants could recognize technology-related icons. Significant correlations (p < 0.05) are displayed with colored circles whereas non-significant relationships are blank.

Figure 6. ARC user experience survey results. Note. Of the 290 participants included in the present analyses, 228 completed the ARC user experience survey which assessed participants attitudes towards their experience with the ARC application after their first week using it.

Nicosia et al. supplementary material

Tables S1-S2 and Figures S1-S3

File 577.2 KB

Article contents

Unsupervised high-frequency smartphone-based cognitive assessments are reliable, valid, and feasible in older adults at risk for Alzheimer’s disease

Abstract

Keywords

Introduction

Methods

Participants

Clinical assessment

Conventional cognitive assessments

Ambulatory research in cognition (ARC) application

Feasibility and tolerability measures

Cerebrospinal fluid collection and processing

Neuroimaging

Statistical analyses

Results

Participant characteristics

Descriptive statistics

Between-person reliability

Test-retest reliability

Construct validity

Criterion validity

Feasibility and tolerability

Discussion

Limitations and future considerations

Supplementary material

Author contributions

Funding statement

Conflicts of interest

Footnotes

References

Nicosia et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests