Introduction
Ecological validity is typically conceptualized as a test’s ability to predict various aspects of daily functioning (Long, Reference Long, Sbordone and Long1996; Sbordone, Reference Sbordone, Sbordone and Long1996), which is often, although by no means exclusively, operationalized as the ability to engage in basic and instrumental activities of daily living (IADLs)Footnote 1 . From a neurocognitive standpoint, IADLs rely on executive functioning (EF; for review, see Overdorp et al., Reference Overdorp, Kessels, Claassen and Oosterman2016), that is, a set of higher-order neurocognitive processes necessary for execution of goal-directed and future-oriented behavior (e.g., Lezak et al., Reference Lezak, Howieson, Bigler and Tranel2012; Suchy, Reference Suchy2015). However, traditional neuropsychological tests of EF have been criticized for having poor ecological validity (e.g., Allain et al., Reference Allain, Alexandra Foloppe, Besnard, Yamaguchi, Etcharry-Bouyx, Le Gall and Richard2014; Chevignard et al., Reference Chevignard, Taillefer, Picq, Poncet, Noulhiane and Pradat-Diehl2008; Jovanovski et al., Reference Jovanovski, Zakzanis, Campbell, Erb and Nussbaum2012; Longaud-Valès et al., Reference Longaud-Valès, Chevignard, Dufour, Grill, Puget, Sainte-Rose, Valteau-Couanet and Dellatolas2016; Renison et al., Reference Renison, Ponsford, Testa, Richardson and Brownfield2012; Rosetti et al., Reference Rosetti, Ulloa, Reyes-Zamorano, Palacios-Cruz, de la Peña and Hudson2018; Shimoni et al., Reference Shimoni, Engel-Yeger and Tirosh2012; Torralva et al., Reference Torralva, Strejilevich, Gleichgerrcht, Roca, Martino, Cetkovich and Manes2012; Valls-Serrano et al., Reference Valls-Serrano, Verdejo-García, Noël and Caracuel2018; Werner et al., Reference Werner, Rabinowitz, Klinger, Korczyn and Josman2009). This criticism is surprising, given that such tests have repeatedly shown effectiveness in predicting IADLs (e.g., Bell-McGinty et al., Reference Bell-McGinty, Podell, Franzen, Baird and Williams2002; Boyle et al., Reference Boyle, Paul, Moser and Cohen2004; Cahn-Weiner et al., Reference Cahn-Weiner, Boyle and Malloy2002; Johnson et al., Reference Johnson, Lui and Yaffe2007; Karzmark et al., Reference Karzmark, Llanes, Tan, Deutsch and Zeifert2012; Nguyen et al., Reference Nguyen, Copeland, Lowe, Heyanka and Linck2020; Perna et al., Reference Perna, Loughan and Talka2012; Putcha & Tremont, Reference Putcha and Tremont2016; Sudo et al., Reference Sudo, Alves, Ericeira-Valente, Alves, Tiel, Moreira, Laks and Engelhardt2015).
The apparently unwarranted criticism of traditional EF tests is likely related to inconsistent conceptualizations of the term “ecological validity.” Specifically, in addition to the notion that tests that predict real-world functioning are ecologically valid, ecological validity is sometimes conceptualized as a combination of both the test’s ability to predict functioning and the test’s resemblance to daily life (Franzen & Wilhelm, Reference Franzen, Wilhelm, Sbordone and Long1996). Since this latter characteristic is conspicuously lacking in traditional neuropsychological tests, there have been calls for the development of new tests that would resemble the “real world” (e.g., Burgess et al., Reference Burgess, Alderman, Forbes, Costello, Coates, Dawson, Anderson, Gilbert, Dumontheil and Channon2006; Spooner & Pachana, Reference Spooner and Pachana2006). These calls led to the introduction of many face-validFootnote 2 measures, ranging from paper-and-pencil tests (e.g., Kenworthy et al., Reference Kenworthy, Freeman, Ratto, Dudley, Powell, Pugliese, Strang, Verbalis and Anthony2020; Torralva et al., Reference Torralva, Strejilevich, Gleichgerrcht, Roca, Martino, Cetkovich and Manes2012; Wilson, Reference Wilson1993; Zartman et al., Reference Zartman, Hilsabeck, Guarnaccia and Houtz2013) to tests performed in real (e.g., Shallice & Burgess, Reference Shallice and Burgess1991), mock (e.g., Chevignard et al., Reference Chevignard, Catroppa, Galvin and Anderson2010; Lamberts et al., Reference Lamberts, Evans and Spikman2010; Rosenblum et al., Reference Rosenblum, Frisch, Deutsh-Castel and Josman2015; Schmitter-Edgecombe et al., Reference Schmitter-Edgecombe, Cunningham, McAlister, Arrotta and Weakley2021), and virtual environments (e.g., Chicchi Giglioli et al., Reference Chicchi Giglioli, Pérez Gálvez, Gil Granados and Alcañiz Raya2021; Josman et al., Reference Josman, Klinger and Kizony2009; Jovanovski et al., Reference Jovanovski, Zakzanis, Campbell, Erb and Nussbaum2012). However, translation of these measures into clinical practice has been lacking, with only one such instrument, the Behavioural Assessment of Dysexecutive Syndrome (BADS) battery (Wilson et al., Reference Wilson, Evans, Emslie, Alderman and Burgess1998), currently utilized clinically (Rabin et al., Reference Rabin, Burton and Barr2007). The BADS is comprised of six paper-and-pencil tasks designed to mimic daily life. From among these, the Modified Six Elements Test (MSET) is often considered the most sensitive to cognitive deficits (Burgess et al., Reference Burgess, Alderman, Evans, Emslie and Wilson1998; Burgess et al., Reference Burgess, Alderman, Forbes, Costello, Coates, Dawson, Anderson, Gilbert, Dumontheil and Channon2006; Emmanouel et al., Reference Emmanouel, Mouza, Kessels and Fasotti2014).
The MSET is modeled after the Multiple Errands Test (Shallice & Burgess, Reference Shallice and Burgess1991) and is intended to approximate demands of daily life. It has been shown to detect cognitive deficits in persons with mild cognitive impairment and Alzheimer’s dementia (Canali et al., Reference Canali, Dozzi Brucki and Amodeo Bueno2007; Espinosa et al., Reference Espinosa, Alegret, Boada, Vinyes, Valero, Martínez-Lage, Peña-Casanova, Becker, Wilson and Tárraga2009; Esposito et al., Reference Esposito, Rochat, Van der Linden, Lekeu, Quittre, Charnallet and Van der Linden2010; da Costa et al., Reference da Costa, Pompeu, Moretto, Silva, dos Santos, Nitrini and Brucki2022), Parkinson’s disease (Perfetti et al., Reference Perfetti, Varanese, Mercuri, Mancino, Saggino and Onofrj2010), brain injury (Emmanouel et al., Reference Emmanouel, Mouza, Kessels and Fasotti2014; Gilboa et al., Reference Gilboa, Jansari, Kerrouche, Uçak, Tiberghien, Benkhaled, Aligon, Mariller, Verdier, Mintegui, Abada, Canizares, Goldstein and Chevignard2019; Norris & Tate, Reference Norris and Tate2000; Wilson et al., Reference Wilson, Evans, Emslie, Alderman and Burgess1998), autism spectrum disorder (Hill & Bird, Reference Hill and Bird2006; White et al., Reference White, Burgess and Hill2009), schizophrenia (Liu et al., Reference Liu, Chan, Chan, Tang, Chiu, Lam, Chan, Wong, Hui and Chen2011; Wilson et al., Reference Wilson, Evans, Emslie, Alderman and Burgess1998), and substance use (Valls-Serrano et al., Reference Valls-Serrano, Verdejo-García, Noël and Caracuel2018; Verdejo-García & Pérez-García, Reference Verdejo-García and Pérez-García2007). However, findings about MSET’s ability to predict daily functioning have been mixed. Specifically, while some studies demonstrated associations of the MSET with behavioral measures (Alderman et al., Reference Alderman, Burgess, Knight and Henman2003; Chevignard et al., Reference Chevignard, Taillefer, Picq, Poncet, Noulhiane and Pradat-Diehl2008; Conti & Brucki, Reference Conti and Brucki2018; Frisch et al., Reference Frisch, Förstl, Legler, Schöpe and Goebel2012) and rating scales of IADLs or daily EF lapses (Allain et al., Reference Allain, Alexandra Foloppe, Besnard, Yamaguchi, Etcharry-Bouyx, Le Gall and Richard2014; Burgess et al., Reference Burgess, Alderman, Evans, Emslie and Wilson1998; Clark et al., Reference Clark, Prior and Kinsella2000; Emmanouel et al., Reference Emmanouel, Mouza, Kessels and Fasotti2014; Jovanovski et al., Reference Jovanovski, Zakzanis, Campbell, Erb and Nussbaum2012; Lamberts et al., Reference Lamberts, Evans and Spikman2010; Renison et al., Reference Renison, Ponsford, Testa, Richardson and Brownfield2012; Rochat et al., Reference Rochat, Ammann, Mayer, Annoni and Van der Linden2009), others yielded null results (e.g., Bertens et al., Reference Bertens, Fasotti, Egger, Boelen and Kessels2016; Gilboa et al., Reference Gilboa, Rosenblum, Fattal-Valevski, Toledano-Alhadef and Josman2014; Jovanovski et al., Reference Jovanovski, Zakzanis, Campbell, Erb and Nussbaum2012; Norris & Tate, Reference Norris and Tate2000; Romundstad et al., Reference Romundstad, Solem, Brandt, Hypher, Risnes, Rø, Stubberud and Finnanger2022; Roy et al., Reference Roy, Allain, Roulin, Fournet and Le Gall2015; Schaeffer et al., Reference Schaeffer, Weerawardhena, Becker and Callahan2022). Despite this mixed evidence, MSET is routinely described as being “ecologically valid,” implying that its ability to predict daily life has been well documented (e.g., Espinosa et al., Reference Espinosa, Alegret, Boada, Vinyes, Valero, Martínez-Lage, Peña-Casanova, Becker, Wilson and Tárraga2009; O’Shea et al., Reference O’Shea, Poz, Michael, Berrios, Evans and Rubinsztein2010; de Almeida et al., Reference de Almeida, Macedo, Lopes and Monteiro2014; Spitoni et al., Reference Spitoni, Aragonaa, Bevacqua, Cotugno and Antonucci2018; Verdejo-García & Pérez-García, Reference Verdejo-García and Pérez-García2007; Wilson et al., Reference Wilson, Evans, Emslie, Alderman and Burgess1998).
In summary, while traditional EF tests have a large body of evidence supporting their ability to predict various functional outcomes, they are nevertheless criticized for having poor ecological validity, ostensibly due to their lack of face validity. In contrast, MSET is routinely, if not universally, described as an ecologically valid measure, even though the support for its ability to predict functional outcomes is mixed. In addition, by virtue of being widely deemed ecologically valid, MSET is also deemed to be inherently superior to traditional EF tests (Burgess et al., Reference Burgess, Alderman, Forbes, Costello, Coates, Dawson, Anderson, Gilbert, Dumontheil and Channon2006). The purpose of the present study was two-fold: (1) First, given the inconsistent findings, we aimed to comprehensively test the ability of MSET to predict daily functioning, using three different modalities of IADL assessment: self-report, lab-based behavioral assessment, and independent performance at home. (2) Second, given the common impression that tests described as “ecologically valid” are inherently superior to traditional tests of EF, we compared MSET to a traditional EF measure as IADL predictors. To these ends, we administered the MSET and subtests from the Delis-Kaplan Executive Function System (D-KEFS) battery to community-dwelling older adults. Participants also completed Lawton IADL questionnaire (Lawton & Brody, Reference Lawton and Brody1969), Timed Instrumental Activities of Daily Living test (TIADL; Owsley et al., Reference Owsley, Sloane, McGwin and Ball2002), and a three-week protocol of IADL tasks completed independently at home (Brothers & Suchy, Reference Brothers and Suchy2021; Suchy et al., Reference Suchy, Lipio Brothers, DesRuisseaux, Gereau, Davis, Chilton and Schmitter-Edgecombe2022). Given that we previously showed that face validity in and of itself does not improve a test’s ability to predict IADLs (Suchy et al., Reference Suchy, Lipio Brothers, DesRuisseaux, Gereau, Davis, Chilton and Schmitter-Edgecombe2022; Ziemnik & Suchy, Reference Ziemnik and Suchy2019), we hypothesized that MSET would predict IADLs in all three modalities, but D-KEFS would account for IADL variance beyond MSET.
Method
Participants
Participants were 100 older adults recruited into the DAILIES study examining the impacts of contextual factors on daily functioning (see Brothers & Suchy, Reference Brothers and Suchy2022). For inclusion, participants needed to be at least 60 years of age, living independently, and, per self-report, not previously diagnosed with dementia, mild cognitive impairment, or other significant neurological disorders (e.g., essential tremor, stroke). Participants were excluded if they self-reported color-blindness, uncorrected hearing or visual impairments that would preclude task performance, less than eight years of formal education, or were not fluent/literate in English. Seven participants were excluded due to missing data on primary variables, for a final sample of 93 participants (69% female). Participants were primarily non-Hispanic White (89%), with 5.4% self-reporting being Hispanic/Latine and 5.4% declining to disclose ethnicity. Additionally, 84% were right-handed, 58% lived with a spouse/partner, and 80% were retired. See Table 1 for additional sample characteristics. Approximately 50 participants from the present sample were included in previous studies (Brothers & Suchy, Reference Brothers and Suchy2021; Suchy et al., Reference Suchy, Lipio Brothers, DesRuisseaux, Gereau, Davis, Chilton and Schmitter-Edgecombe2022), but MSET was not examined in those studies.
Note: N = 93; DRS-2=Dementia Rating Scale Second Edition; GDS = Geriatric Depression Scale; SD = Standard Deviation. For three participants with missing GDS scores, the missing values were replaced with the sample mean.
Procedures
Participants were screened over the telephone. Eligible participants completed about four hours of baseline testing, including self-report and cognitive measures used for the larger study. At the end of the testing, participants were given instructions and practice items for the at-home assessment of IADLs. After three weeks of completing at-home IADL tasks, participants returned for debriefing and, if interested, feedback about their overall cognitive/psychiatric functioning. Participants were reimbursed $10 per hour for the baseline visit, $20 for the feedback visit, and $4 for each at-home task. The study was approved by the University of Utah Institutional Review Board and was conducted in accordance with Helsinki Declaration. P values < .05, two-tailed, were considered statistically significant.
Measures
Characterizing the sample
To characterize the participants’ general cognitive status and depressive symptoms, we used the Dementia Rating Scale-Second Edition (DRS-2; Jurica et al., Reference Jurica, Leitten and Mattis2001) and the 30-item version of the Geriatric Depression Scale (GDS; Yesavage, Reference Yesavage1988), respectively. Three participants had missing GDS scores that were replaced with the sample mean.
Modified six elements test (MSET)
The MSET (Wilson et al., Reference Wilson, Evans, Emslie, Alderman and Burgess1998) is modeled after the Multiple Errands Test (Shallice & Burgess, Reference Shallice and Burgess1991), designed to rely on cognitive processes needed in daily life, including meta-tasking, initiation, prospective memory, and self-monitoring. The MSET requires examinees perform six tasks within 10 minutes while adhering to certain rules. The tasks include dictating responses to two story prompts, solving and recording answers to two sets of simple arithmetic problems, and recording answers to two sets of object-naming problems. Examinees are instructed that it is not possible to complete all six tasks within the allotted time but that they should (a) complete at least some portion of each task and (b) avoid completing two tasks of the same type in a row. Thus, examinees must spontaneously switch among tasks in accordance with the rules, while avoiding running out of time. The total score consists of the number of tasks attempted (a maximum of six is possible) minus (a) the number of rule breaks and (b) one point for inefficient use of time (defined as spending more than 271 s on any one task). Total possible scores range from zero to six. Prior research has reported low test-retest reliabilities, ranging from .43 to .48 (Bertens et al., Reference Bertens, Fasotti, Egger, Boelen and Kessels2016; Jelicic et al., Reference Jelicic, Henquet, Derix and Jolles2001), as is common for many tests of EF (Calamia et al., Reference Calamia, Markon and Tranel2013; Suchy & Brothers, Reference Suchy and Brothers2022), but high interrater reliability (r = .88; Wilson et al., Reference Wilson, Evans, Emslie, Alderman and Burgess1998).
D-KEFS. The D-KEFS is a battery of traditional EF tasks with low face validity. We generated a composite from four timed subtests (Trail Making Test, Verbal Fluency, Design Fluency, and Color-Word Interference; Delis et al., Reference Delis, Kaplan and Kramer2001), consistent with prior research (e.g., Franchow & Suchy, Reference Franchow and Suchy2015, Reference Franchow and Suchy2017). The composite was generated from scores designated as “primary’’ in the test manual. First, raw subtest scores were converted to scaled scores based on normsFootnote 3 (Delis et al., Reference Delis, Kaplan and Kramer2001). Next, we generated a single score for each subtest by averaging across the scores from the relevant executive conditions within that subtest: one condition of the Trail Making Test (number-letter switching completion time), three Design Fluency conditions (number correct in filled dots, empty dots, and switching), three Verbal Fluency conditions (number correct in letter, category, and category switching), and two Color-Word Interference conditions (interference and interference-switching completion times). We then averaged across the four subtests to generate the final D-KEFS composite. Cronbach’s alpha in this sample was .78. Test-retest reliabilities were not tested in the present sample but were previously reported at .90 (Suchy & Brothers, Reference Suchy and Brothers2022).
Because performance on timed EF measures is influenced by lower-order processes (e.g., graphomotor speed, visual scanning, etc.; Suchy, Reference Suchy2015; Stuss & Knight, Reference Stuss and Knight2002), we also generated a lower-order process composite. Specifically, we averaged the scaled scores3 of subtest conditions designed to isolate lower-order processes as defined by the D-KEFS manual (Delis et al., Reference Delis, Kaplan and Kramer2001): four Trail Making Test conditions (visual scanning, number sequencing, letter sequencing, and motor speed) and two Color-Word Interference conditions (color naming and word reading completion time). This composite was used as a covariate to help isolate the EF construct, as done in prior research (e.g., Franchow & Suchy, Reference Franchow and Suchy2015, Reference Franchow and Suchy2017; Kraybill & Suchy, Reference Kraybill and Suchy2011; Kraybill et al., Reference Kraybill, Thorgusen and Suchy2013). Cronbach’s alpha in this sample was .76. Because of the heavy speed demands of these tasks, we refer to this variable as “Processing Speed” below.
Self-reported IADLs
Self-reported IADLs were assessed using the Lawton IADL scale (Lawton & Brody, Reference Lawton and Brody1969). Individuals rate their level of independence (on a three-point scale) in seven IADL domains. The scale has been extensively validated (e.g., Mariani et al., Reference Mariani, Monastero, Ercolani, Rinaldi, Mangialasche, Costanzi, Vitale, Senin and Mecocci2008; Ng et al., Reference Ng, Niti, Chiam and Kua2006), with a test-retest reliability reported at .85 (Lawton & Brody, Reference Lawton and Brody1969). Internal consistency could not be calculated in this sample, as some items lacked variability, as can be expected in high functioning samples. Higher scores on this scale indicate fewer problems. Hereafter, we call this variable “IADLs-Report.”
Lab-based IADLs
Participants completed the performance-based Timed Instrumental Activities of Daily Living (TIADL; Owsley et al., Reference Owsley, Sloane, McGwin and Ball2002), comprised of tasks related to communication (e.g., finding a telephone number in a phone book), finance (e.g., making change), food (reading ingredients on cans of food), shopping (e.g., finding food items on a shelf), and medication management (e.g., reading instructions on medicine bottles). Completion times for the five tasks were converted to z-scores based on the current sample, then averaged to create a speed composite. Errors across the five tasks were summed, then also converted to a z-score based on the current sample. The speed and error composites were then averaged to generate an overall performance score for the TIADL. Cronbach’s alpha in this sample was .73. Test-retest reliability was not available for this sample, but was previously reported at .85 (Owsley et al., Reference Owsley, Sloane, McGwin and Ball2002). Higher scores on this composite are indicative of poorer performance (i.e., more time spent and/or a greater number of errors). Hereafter we call this composite “IADLs-Lab.”
Home-based IADLs
To assess participants’ IADLs at home, we used the Daily Assessment of Independent Living and Executive Skills (DAILIES) protocol (Brothers & Suchy, Reference Brothers and Suchy2022). The DAILIES asks participants to complete brief tasks that resemble typical IADLs (e.g., paying utility bills, canceling a doctor’s appointment, filling out a rebate form, etc.) six days a week for three weeks. Participants must complete the tasks during specified timeframes (e.g., 9:00 to 11:00 AM) that vary each day to resemble real-world demands, and communicate about task completion with the researchers via email, telephone, or postal mail, (again varied daily to mimic typical real-life demands). Tasks are scored based on timeliness (one point if a response is provided on the correct day, and one point if the response is provided during the allotted timeframe, for a total of two possible points) and accuracy (scores ranging from one to seven depending on complexity). The scores from each task are summed to generate the total DAILIES score (possible maximum of 93 points). Higher scores indicate better performance. Internal consistency was not calculated, as IADLs-home is a “formative” variable intended to provide a sum total of correctly completed tasks during the given timeframe. This is in contrast to “reflective” variables, which are intended to “reflect” a construct (Kievit et al., Reference Kievit, Romeijn, Waldorp, Wicherts, Scholte and Borsboom2011). Hereafter, we call this variable “IADLs-Home.”
Results
Preliminary analyses
Score distribution
All variables were examined for outliers and normality. IADLs-Home had one outlier and IADLs-Lab had two outliers, which was remedied via Winsorization. IADLs-Report, MSET, GDS, and DRS-2 exhibited a skew that was remedied via log-transform. After these procedures, all variables were normally distributed (all Skewness values <1), except for MSET, which still evidenced slight skew (Skewness = 1.53). Thus, we conducted supplementary non-parametric analyses with the MSET.
Debriefing
Debriefing forms were available for 81 participants. The majority of participants (91.3%) felt the DAILIES tasks were similar to typical tasks they complete in daily life (i.e., responding ‘agree’ or ‘strongly agree’ to this item).
Descriptives and zero-order correlations
Descriptives for all dependent and independent variables are presented in Table 2, and zero-order correlations of dependent and independent variables with potential confounds are presented in Table 3. As seen, age, education, general cognitive status, and processing speed were all associated with at least some of the dependent or independent variables. Additionally, we examined correlations among the three IADL variables. Interestingly, while IADLs-Lab and IADLs-Report were correlated (p = .023), IADLs-Home was unrelated with lab-based and self-reported IADLs (p-values > .200). The three IADL variables were thus examined individually in all analyses.
Note: N = 93. For variables that were normalized via transformation or log-transformation, the transformed scores are presented in the table, as indicated in variable names. D-KEFS = Delis-Kaplan Executive Function System composite score; IADLs-Home = Daily Assessment of Independent Living and Executive Skills (DAILIES) total score; IADLs-Lab = Timed Instrumental Activities of Daily Living (TIADLs) total score; IADLs-Report = Lawton Instrumental Activities of Daily Living raw score; MSET = Modified Six Elements Test.
Note: N = 93. For variables that were normalized via transformation or log-transformation, the normalized scores were used in analyses, as indicated in variable names. DRS-2 = Dementia Rating Scale, Second Edition, raw score; GDS = Geriatric Depression Scale; D-KEFS = Delis-Kaplan Executive Function System composite score; IADL-Report = Lawton Instrumental Activities of Daily Living raw score; IADLs-Lab = Timed Instrumental Activities of Daily Living (TIADLs) total score; IADLs-Home = Daily Assessment of Independent Living and Executive Skills (DAILIES) total score; MSET = Modified Six Elements Test. Non-parametric correlations (Spearman’s rho) for the MSET, which exhibited a slight skew, are presented in parentheses. Sex was coded 1 = female, 0 = male (thus, women reported fewer IADL problems on self-report).
* p < .05; ** p < .01, *** p < .001.
Principal analyses
Univariate associations
Zero-order correlations between the dependent and independent variables are presented in Table 4. As seen, D-KEFS was associated with all three IADL variables; contrary to expectation, MSET was associated only with IADLs-Home.
Note: N = 93. For variables that were normalized via transformation or log-transformation, the normalized scores were used in analyses, as indicated in variable names. D-KEFS = Delis-Kaplan Executive Function System composite score; IADL-Report = Lawton Instrumental Activities of Daily Living raw score; IADLs-Lab = Timed Instrumental Activities of Daily Living (TIADLs) total score; IADLs-Home = Daily Assessment of Independent Living and Executive Skills (DAILIES) total score; MSET = Modified Six Elements Test. Non-parametric correlations (Spearman’s rho) for the MSET, which exhibited a slight skew, are presented in parentheses.
* p < .05; **p < .01.
Pitting D-KEFS and MSET against each other
Because both D-KEFS and MSET showed univariate associations with IADLs-Home, we wanted to examine whether these variables predicted IADLs-Home beyond each other. Additionally, even though MSET was unrelated to IADLs-Report and IADLs-Lab, we nevertheless wanted to examine whether D-KEFS predicted these variables beyond MSET. Thus, we ran three general linear regressions, using IADLs-Home, IADLs-Report, and IADLs-Lab as dependent variables and MSET and D-KEFS as predictors. As seen in Table 5, D-KEFS predicted all three IADLs variables beyond MSET, whereas MSET did not contribute variance beyond D-KEFS.
Note: N = 93. IADL variables used in analyses were normalized as indicated in variable names. MSET = Modified Six Elements Test (log transformed variable was used in analyses); D-KEFS = Delis-Kaplan Executive Function System composite score; IADLs-Home = Home-based performance of IADLs; IADLs-Report = Lawton Instrumental Activities of Daily Living; IADLs-Lab = Timed Instrumental Activities of Daily Living (TIADLs) total score. In corresponding hierarchical models for IADLs-Home, IADLs-Lab, and IADLs-Report, the D-KEFS accounted for 12%, 25%, and 9% of variance beyond the MSET, respectively.
Effects of covariates
We next examined whether D-KEFS still predicted the IADL variables when confounds identified in Table 3 were included as covariates. We ran three general linear regressions, again using the three IADL variables as dependent variables, D-KEFS as a predictor, and age, education, GDS, DRS, and Processing Speed as covariates. As seen in Table 6, D-KEFS again emerged as a unique predictor of IADLs across all three modalities.
Note: N = 93. IADL variables used in analyses were normalized as indicated in variable names. DRS-2 = Dementia Rating Scale, Second Edition, raw score; GDS = Geriatric Depression Scale; D-KEFS = Delis-Kaplan Executive Function System composite score; IADLs-Report = Lawton Instrumental Activities of Daily Living; IADLs-Lab = Timed Instrumental Activities of Daily Living (TIADLs) total score; IADLs-Home = Daily Assessment of Independent Living and Executive Skills (DAILIES) total score.
Supplementary analyses
Individual D-KEFS subtests
Because the MSET variable was based on a single test, whereas the D-KEFS was a composite of four subtests, one could argue that D-KEFS had an unfair advantage over MSET due to a broader range of sampled processes and higher reliability. To address this issue, we examined partial correlations of the four individual D-KEFS subtests with the three IADL variables, controlling for MSET. As seen in Table 7, all but three correlations were statistically significant, illustrating that even single traditional EF tests with narrower scope and lower reliabilities tend to outperform the MSET.
Note: N = 93; N = 90 for GDS. IADL variables used in analyses were normalized as indicated in variable names. D-KEFS = Delis-Kaplan Executive Function System; DF = Design Fluency; CWI = Color-Word Interference; TMT = Trail Making Test; VF = Verbal Fluency; IADLs-Report = Lawton Instrumental Activities of Daily Living; IADLs-Lab = Timed Instrumental Activities of Daily Living (TIADLs) total score; IADLs-Home = Daily Assessment of Independent Living and Executive Skills (DAILIES) total score; MSET = Modified Six Elements Test.
* p < .05; **p < .01; ***p < .001.
Homogenizing the sample
In the present sample, two participants’ DRS-2 scores fell below 123, the level that is considered normal (Jurica et al., Reference Jurica, Leitten and Mattis2001). To ensure that the results were not driven by these two participants, we reran all principal analyses with these two participants removed. The correlation between MSET and IADLs-Home was no longer significant, Spearman’s Rho = −.201, p = .055. In contrast, D-KEFS maintained all significant results reported in prior analyses (all p values < .05).
Discussion
The aims of the present study were to empirically examine the widely-held assumptions that MSET performance predicts daily IADL functioning and that MSET’s clinical utility is superior to that of traditional EF measures. To these ends, we administered the MSET, four subtests from the D-KEFS battery, and three measures of IADLs to a sample of 93 community-dwelling older adults. IADLs were assessed via three modalities: self-report, lab-based behavioral tasks, and home-based tasks completed over three weeks. The key findings are that (a) MSET predicted performance of IADL tasks at home, (b) D-KEFS was associated with IADLs in all three assessment modalities, and (c) D-KEFS accounted for variance in IADLs beyond MSET, as well as beyond potential demographic, cognitive, and psychiatric confounds, whereas MSET did not contribute beyond the D-KEFS.
MSET and ecological validity
The present results are consistent with prior research in that they provide somewhat equivocal, or “soft,” evidence of MSET’s ability to predict daily functioning. Specifically, while the MSET did predict how participants performed IADLs at home, it was not associated with either of the other two IADL measures. Since lab-based and self-reported IADL measures were not related to IADLs performed at home, it is likely that they reflected different aspects of functioning, suggesting that MSET may only tap into a subset of IADL capacity. For example, the home-based IADL protocol was less structured and required greater use of prospective memory than the other IADL measures; similarly, MSET is intended to be less structured and rely more heavily on prospective memory, possibly explaining its association with the home-based IADLs and the lack of association with the other two IADL measures.
Importantly, contrary to the widely-held beliefs about the superiority of tests with high face validity, MSET did not evidence any advantage over D-KEFS. Instead, D-KEFS predicted IADLs well beyond MSET. It is thus likely that D-KEFS taps into a broader range of EF processes than MSET. Indeed, traditional EF tests have been shown to predict occupational functioning (for reviews see Gilbert & Marwaha, Reference Gilbert and Marwaha2013; Ownsworth & McKenna, Reference Ownsworth and McKenna2004), whereas MSET has not (Moriyama et al., Reference Moriyama, Mimura, Kato, Yoshino, Hara, Kashima, Kato and Watanabe2002), further suggesting that MSET may tap a narrower range of processes. While it could be argued that our D-KEFS composite understandably taps a broader range of processes due to being based on four different subtests, it is noteworthy that D-KEFS subtests outperformed the MSET even when examined individually. Lastly, given that MSET was no longer associated with IADLs once two participants with mildly impaired cognition were removed, it appears that MSET is vulnerable to ceiling effects and as such is not sensitive to subtle deficits. Together, the test’s somewhat narrow range of sensitivity, combined with potentially a somewhat narrow scope of IADL capacities to which it is related, may explain the inconsistent findings in prior research.
Alternatively, prior methodological limitations may also explain the inconsistent findings in prior literature. Specifically, prior ecological validations of the MSET reviewed in the introduction utilized only between 24 and 120 participants (median = 47.5). Since about one half of the reviewed studies attempted MSET validation on samples smaller than 50, their results may be unstable (e.g., Harris, Reference Harris1985; Van Voorhis & Morgan, Reference Van Voorhis, Carmen and Morgan2007) and vulnerable to non-replication. Poor reliability is yet another possible explanation. Regardless of the sources of inconsistency, the present study suggests that MSET does not incrementally improve upon D-KEFS in predicting functional outcomes, at least not among community-dwelling older adults.
The importance of outcome variables
Past research examining the associations between EF tests and daily functioning has been criticized for relying predominantly on participants or collateral reports about IADLs (for review, see Robertson & Schmitter-Edgecombe, Reference Robertson and Schmitter-Edgecombe2016). In contrast, the present study utilized the DALIES, which (per participant endorsement) closely mimics typical daily tasks. The DAILIES has several advantages over typically used methods. First, it reflects IADL performance within the context of daily life, with participants completing the DALIES while also attending to other daily demands, responsibilities, or distractions. Thus, participants needed to independently plan and problem-solve how to interleave the DAILIES within their daily routines while also engaging prospective memory to complete the tasks during the correct time frames.
Second, the DALIES assesses participants’ performance of IADLs over a somewhat extended period, unlike typical behavioral assessments that examine a single “snapshot” in time. An extended assessment period is critical since EF is known to fluctuate due to a variety of contextual factors (Berryman et al., Reference Berryman, Stanton, Bowering, Tabor, McFarlane and Moseley2014; Suchy et al., Reference Suchy, Lipio Brothers, DesRuisseaux, Gereau, Davis, Chilton and Schmitter-Edgecombe2022; Franchow & Suchy, Reference Franchow and Suchy2015, Reference Franchow and Suchy2017; Tinajero et al., Reference Tinajero, Williams, Cribbet, Rau, Bride and Suchy2018), leading to lapses in IADLs that are intermittent and thus cannot be readily captured in a single assessment session. Importantly, since individuals with even mild EF weaknesses are more vulnerable to experiencing such fluctuations (Killgore et al., Reference Killgore, Grugle, Reichardt, Killgore and Balkin2009; Williams et al., Reference Williams, Suchy and Rau2009), predictors of daily functioning need to be sensitive to such subtle EF weaknesses.
Lastly, the DAILIES allowed us to examine whether our tests can predict IADLs prospectively, generalizing from performance assessed at one timepoint to a future behavior at home. In contrast, most research examines concurrent associations between EF measures and IADL tasks (e.g., Alderman et al., Reference Alderman, Burgess, Knight and Henman2003; Conti & Brucki, Reference Conti and Brucki2018; Frisch et al., Reference Frisch, Förstl, Legler, Schöpe and Goebel2012; Suchy et al., Reference Suchy, Niermeyer, Franchow and Ziemnik2019), potentially confounding results with a third variable shared in space and time, such as experiencing pain (Attridge et al., Reference Attridge, Noonan, Eccleston and Keogh2015; Heyer et al., Reference Heyer, Sharma, Winfree, Mocco, Mahon, Cormick, Quest, Murtry, Riedel, Lazar, Stern and McConnolly2000) or not having slept well the night before testing (Fortier-Brochu et al., Reference Fortier-Brochu, Beaulieu-Bonneau, Ivers and Morin2012; Holding et al., Reference Holding, Ingre, Petrovic, Sundelin and Axelsson2021; Miyata et al., Reference Miyata, Noda, Iwamoto, Kawano, Okuda and Ozaki2013). Indeed, such contextual factors have an impact on both EF (Berryman et al., Reference Berryman, Stanton, Bowering, Tabor, McFarlane and Moseley2014; Niermeyer & Suchy, Reference Niermeyer and Suchy2020; Tinajero et al., Reference Tinajero, Williams, Cribbet, Rau, Bride and Suchy2018) and IADLs (Hicks et al., Reference Hicks, Gaines, Shardell and Simonsick2008; Stamm et al., Reference Stamm, Pieber, Crevenna and Dorner2016; Webb et al., Reference Webb, Cui, Titus, Fiske and Nadorff2018), potentially confounding concurrently observed associations.
Traditional tests of EF, ecological validity, and a call to action
Despite the fact that the present results offer only a somewhat “soft” support of the MSET’s ability to predict daily functioning, they nevertheless do, at least technically, support the MSET’s ecological validity in that the MSET does possess face validity and does relate (albeit weakly) to daily IADL performance. Interpretation is less straightforward for the D-KEFS. On the one hand, if we define ecological validity as the test’s ability to predict functional outcome, then the D-KEFS certainly appears to be more ecologically valid than the MSET. On the other hand, if we define ecological validity as requiring that the test have face validity, then the D-KEFS cannot be deemed ecologically valid regardless of how well it predicts daily functioning. This latter perspective defies any clinical utility of the term ecological validity. It is our position that the term ecological validity “muddies the waters,” misleading clinicians and mischaracterizing the clinical utility of tests. The usage of the term often leads to the erroneous impressions that (a) traditional EF tests cannot possibly predict daily functioning since they lack face validity, and (b) tests with high face validity are inherently able to predict daily functioning and as such are superior to traditional measures. The term ecological validity has been criticized for similar reasons in other areas of psychology as well (Holleman et al., Reference Holleman, Hooge, Kemner and Hessels2020; Kihlstrom, Reference Kihlstrom2021). We therefore call on our field to retire the term ecological validity in favor of more concrete terminology and/or concrete descriptions of what a given test can accomplish in a given population. Indeed, depending on the specific study design, the terms predictive, criterion, and concurrent validity communicate clearly what a given test can or cannot accomplish, thereby being more informative and useful, in both clinical and research contexts.
Limitations
The present study needs to be interpreted within the context of some limitations. First, the sample was predominantly non-Hispanic White, highly educated, and comprised of individuals who were high functioning and cognitively healthy, which may have affected the results. Indeed, MSET was skewed, suffering from a ceiling effect, and the MSET results were driven by two mildly impaired participants. Thus, while it appears that D-KEFS is more sensitive to subtle EF deficits than MSET, it is not known whether MSET would outperform the D-KEFS in another, more impaired sample. Additionally, it is unclear whether cultural or linguistic factors would impact performances on currently employed measures unevenly, further impacting results. Thus, we must remind ourselves that validity is specific not only to a given test, but also to a population in which validation occurred.
Second, the present study pitted the D-KEFS composite against a single test. It is possible that a composite of all BADS subtests would perform better than MSET alone, and possibly even better than the D-KEFS. Relatedly, it is possible that the weakness of MSET relative to D-KEFS stems not from its poorer ability to tap into relevant neurocognitive processes (i.e., the measure’s content), but rather its poorer psychometric properties, namely poorer reliability. Future research should examine these questions. Meanwhile, although the present results technically support ecological validity of MSET, they do not support its usage in place of, or in addition to, traditional EF measures.
Conclusions
The present study offers some weak support for the predictive validity of the MSET. However, this support is considerably tempered by the fact that D-KEFS accounted for variance in IADLs beyond MSET, while MSET failed to contribute incrementally to the prediction. Additionally, while D-KEFS was related to two other measures of IADLs (self-report and lab-based performance), MSET was not related to either. Thus, at least among community-dwelling older adults, D-KEFS proves to have a greater clinical utility than MSET. Despite these findings, which favor the D-KEFS over the MSET, the term “ecologically validity” can be applied more confidently to the MSET than to the D-KEFS, due to MSET’s greater face validity. These conclusions demonstrate the lack of clinical utility of the term ecological validity. Thus, we argue that “ecological validity” should be avoided in assessment contexts and, as appropriate, replaced with more descriptive terms such as criterion, predictive, or concurrent validity
Funding statement
The study was funded by the senior author’s development fund awarded by the University of Utah.
Competing interests
None to declare.