Introduction
When Tim Shallice first introduced the Tower of London (TOL) planning task to measure frontal brain functions (Shallice, Reference Shallice1982), this was the starting point for a series of further developments of variants and versions of the TOL and other so-called disk-transfer tasks (Berg & Byrd, Reference Berg and Byrd2002). One reason for these diverse developments was the rather unsatisfactory reliability of the original TOL version in adults (Cronbach’s α = .25, split-half reliability r = .19, Humes, Welsh, & Retzlaff, Reference Humes, Welsh and Retzlaff1997; see also Michalec et al., Reference Michalec, Bezdicek, Nikolai, Harsa, Jech, Silhan, Hyza, Ruzicka and Shallice2017; test–retest reliability, r = .60; Lowe & Rabbitt, Reference Lowe and Rabbitt1998), and in children (Cronbach’s α could not be determined, test–retest reliability was r = .23; Syväoja et al., Reference Syväoja, Tammelin, Ahonen, Räsänen, Tolvanen and Kankaanpää2015). Today, there are several versions and variants of the TOL task that feature acceptable psychometric properties (e.g., Culbertson & Zillmer, Reference Culbertson and Zillmer2005: test–retest r = .75 for total moves, r = .59 for total correct in patients with Parkinson‘s disease; Schnirman, Welsh, & Retzlaff, Reference Schnirman, Welsh and Retzlaff1998: test–retest r = .70; Tucha, & Lange, Reference Tucha and Lange2004: test–retest r = .85; Unterrainer et al., Reference Unterrainer, Rahm, Kaller, Wild, Münzel, Blettner and Beutel2019: internal consistency glb = .76).
In the context of cognitive tasks, reliability indexes the stability of measurements and features mainly two aspects: (i) the task’s internal consistency and split-half reliability reflect the degree to which all items of the task measure the same underlying construct and (ii) the consistency between repeated measurements of identical (test–retest reliability) or highly similar versions (parallel test–retest reliability) of the task. In the present study, we will focus on the latter aspect by studying the test–retest and parallel test–retest reliability of the TOL. Previous studies have mainly reported the Pearson correlation coefficient. However, it is no longer considered an ideal measure of identical and parallel test–retest reliability, as it only captures the relative consistency but not the absolute agreement of test scores over repeated measurements. For absolute agreement, total score variance is taken into account, including not only the variance between two measurements but also within the sample (McCraw & Wong, 1996). There is a growing consensus towards the use of different forms of the intraclass correlation coefficient (ICC, see Shrout & Fleiss, Reference Shrout and Fleiss1979; McCraw & Wong, 1996), as these may inform about both the relative consistency [ICC(3,1)] and the absolute agreement [ICC(2,1)] between the repeated measurements.
Tyburski, Kerestey, Kerestey, Radoń, & Mueller (Reference Tyburski, Kerestey, Kerestey, Radoń and Mueller2021) have recently provided a comprehensive overview of identical and parallel test–retest reliability studies of TOL versions, which also lists four studies on adults that reported ICCs. It is noticeable that with the exception of one study (Köstering, Nitschke, Schumacher, Weiller, & Kaller, Reference Köstering, Nitschke, Schumacher, Weiller and Kaller2015; ICC(2,1) = 0.69 for accuracy in terms of total optimal solutions), the ICCs for the performance parameters remained considerably below the desired requirements of at least 0.5, indicating poor reliability (Portney & Watkins, Reference Portney and Watkins2000). More specifically, for outcomes that consider the number of problems solved, Lemay, Bédard, Rouleau, & Tremblay (Reference Lemay, Bédard, Rouleau and Tremblay2004) report an ICC(2,1) of 0.33, Tunstall, O’Gorman, & Shum (Reference Tunstall, O’Gorman and Shum2016) an ICC(2,1) of 0.45, and Tyburski et al. (Reference Tyburski, Kerestey, Kerestey, Radoń and Mueller2021) even a negative ICC(3,1) of –0.44.
This observation of low replicability of TOL measurements is neither new nor surprising when seen in the wider context of similar findings for other tasks measuring higher-order executive functions (Burgess, Reference Burgess1997; Rabbitt, Reference Rabbitt1997). Planning as a prototypical higher-order executive function reflects the mental generation and evaluation of potential solution alternatives in novel problem situations. This novelty aspect particularly hampers the test–retest reliability assessment of planning tasks, as novelty is not given in a second measurement with identical problem items, and practice effects are likely to occur (Rabbitt, Reference Rabbitt1997; Strauss, Sherman, & Spreen, Reference Strauss, Sherman and Spreen2006). One way to avoid using identical items for the second measurement is to use an alternative or parallel-test version. Accordingly, Calamia, Markon, & Tranel (Reference Calamia, Markon and Tranel2012) observed that the use of alternate forms helps to decrease the size of practice effects, although it does not necessarily increase reliability. In a meta-analysis of test–retest correlations of instruments typically used in neuropsychological assessment (Calamia, Markon, & Tranel, Reference Calamia, Markon and Tranel2013), for a majority of tests the application of a parallel form at retesting was associated with a decrease in the test–retest correlation in comparison to retesting the identical form. Although the magnitude of the differences was generally Δr = .1 or less, according to the authors, psychometric properties like difficulty can differ significantly between test versions.
In this respect, it is an open question whether the test–retest reliability of an identical problem set differs from the parallel test–retest reliability of an alternative but highly similar problem set. One instrument that could be used to systematically address this question is the TOL-Freiburg version (Kaller, Unterrainer, Kaiser, Weisbrod, & Aschenbrenner, Reference Kaller, Unterrainer, Kaiser, Weisbrod and Aschenbrenner2012a). This planning test has a sufficiently high internal consistency (glb = .73 and .76, Kaller et al. Reference Kaller, Debelak, Köstering, Egle, Rahm, Wild and Unterrainer2016, Unterrainer et al. Reference Unterrainer, Rahm, Kaller, Wild, Münzel, Blettner and Beutel2019, respectively) and hence fulfills basic psychometric requirements. It was developed in three parallel-test versions (A, B, and C), whose problems are identical in structure, but whose physical appearance differs due to permutations of ball colors. Thus, these versions should require identical cognitive demands since structural problem parameters like search depth and goal hierarchy were kept completely identical (Kaller, Rahm, Köstering, & Unterrainer, Reference Kaller, Rahm, Köstering and Unterrainer2011; Kaller, Unterrainer, & Stahl, Reference Kaller, Unterrainer and Stahl2012b). Köstering et al. (Reference Köstering, Nitschke, Schumacher, Weiller and Kaller2015) already assessed test–retest reliability of the TOL-Freiburg using version A at the first and B at the second session over a 1-week interval. Pearson correlation (r = .739) and ICC for absolute agreement (r = .690) yielded adequate test–retest reliabilities. As in this study participants performed two different versions and the sample size war rather small (n = 27), here we addressed the question whether the test–retest reliability of an identical problem set (versions A-A and B-B) differs from the parallel test–retest reliability (versions A-B and B-A) on the basis of two larger samples.
Methods
Participants
The study comprised two separate, completely non-overlapping samples including only participants who had no previous experience with the TOL test.
For the parallel test–retest sample, we originally recruited 103 young participants with predominantly high school degrees or who are studying. Inclusion criteria were sufficient German language skills to ensure comprehension of task instructions, age between 18 and 26 years, and normal or corrected-to-normal vision. Exclusion criteria were current/past psychiatric or neurological disease, psychotropic medication, and color blindness. Depressive symptoms, crystallized, and fluid intelligence were assessed with the Beck Depression Inventory-II (BDI-II; Beck, Steer, & Brown, Reference Beck, Steer and Brown1996), a German vocabulary test (Mehrfachwahl-Wortschatz-Intelligenztest or MWT-B; Lehrl, Reference Lehrl2005), and with the Advanced Progressive Matrices (short version, German adaptation and norming, Bulheller, & Häcker, Reference Bulheller and Häcker1998), respectively. Due to increased depression scores (BDI score above 14), ten subjects had to be excluded. The final parallel test–retest sample (N = 93; 49 females) had a mean age of 21.9 years (SD = 1.95; range 18.33–25.92). Participants were compensated with 20€ for both sessions. In the parallel test–retest sample, half of the participants accomplished version A at the first session and version B at the second session, while the other half started with version B in the first session and continued with A in the second session.
For the identical test–retest sample, 93 young participants with predominantly high school degrees or who are studying were recruited applying identical inclusion/exclusion criteria, screening for depressive symptoms, crystallized and fluid intelligence tests, and compensation as for the parallel test–retest sample. In consequence, seven participants had to be excluded resulting in the final identical test–retest sample of 86 participants (48 female) with a mean age of 22.01 (SD = 2.32; range 18.08–26.42). In the identical test–retest sample, half of the participants performed version A in both the first and the second session, while the other half went through the same procedure with version B. Table 1 provides a comparative overview of both samples.
The study was approved by local ethics authorities (EK-Freiburg nr. 479/19). Data acquisition complied with local institutional research standards for human research and was completed in accordance with the Helsinki Declaration.
Tower of London – Freiburg Version (TOL-F)
All participants were tested individually in a quiet room with the TOL-F (Kaller et al., Reference Kaller, Unterrainer, Kaiser, Weisbrod and Aschenbrenner2012a). The TOL-F is as a computerized pseudo-realistic representation of the TOL’s originally wooden configuration and is implemented in the Vienna Test System (https://marketplace.schuhfried.com/de/tol). Individual problem items consist of a start and a goal state that are presented in the lower and upper halves of the computer screen, respectively. In order to transfer the start into the goal state, the TOL-F can be worked on by touch screen. Thus, a ball is picked up simply by clicking the ball via finger touch. The selected ball is then encircled by a transparent whitish corona and can be moved by selecting the respective rod by finger touch.
Subjects are instructed to transform the start state into the goal state in the minimum number of moves which are shown to the left of the start state. Written instructions inform that only one ball may be moved at a time, that balls cannot be placed beside the rods, that only the top-most ball can be moved in case several balls are stacked on a rod, and that the rods differ in their capacities of accommodating one, two, or three balls at maximum. The computer program does not allow breaking these rules, but records any attempts to do so. Instructions further emphasize that problems have to be solved in the minimum number of moves and that participants should always plan ahead the problem solution before starting with movement execution.
For assessment of individual planning ability with the TOL-F, overall planning accuracy, defined as the sum of problems that were correctly solved in the minimum number of moves, is regarded as the primary outcome variable of interest. The TOL-F provides three different levels of minimum moves (four, five, and six move problems, eight of each, presented in increasing minimum number of moves) resulting in an overall planning accuracy of 24 problems at maximum (corresponding to 100 percent). A one-minute time limit per trial was implemented, as in the original study of Shallice (Reference Shallice1982).
The TOL-F features three parallel-test versions, A, B, and C, consisting of the same set of problems, which are color-permuted; that is, ball colors are interchanged (cf. Berg & Byrd, Reference Berg and Byrd2002). Thus, across parallel-test versions, problems are structurally identical, while their physical appearance is different. As already described in the Participants section, in the parallel test–retest sample, half of the participants accomplished version A at the first session and version B at the second session, while the other half started with version B in the first session and continued with A in the second session. In the identical test–retest sample, half of the participants performed version A in both the first and the second session, while the other half went through the same procedure with version B.
Analyses
First, to compare planning accuracy between the two samples and to assess changes over the two time points, a repeated-measurements ANOVA (RM-ANOVA) was calculated with the within-subject factor session (1 versus 2) and the between-subjects factor group (parallel test–retest versus identical test–retest). For assessing parallel and identical test–retest reliabilities, ICCs using two-way random effects models of absolute agreement ICC(2,1) and relative consistency ICC(3,1) corresponding to Shrout and Fleiss (Reference Shrout and Fleiss1979) were computed. For comparability with previous studies, we also report Pearson product–moment correlations as indices of identical/parallel test–retest reliability as well as glb (estimations of greatest lower bound) as index of internal consistency.
Results
Overall planning performance
RM-ANOVA with the within-subject factor session (1 versus 2) and the between-subjects factor group (parallel versus identical test–retest) on planning accuracy revealed significant main effects for session (F(1, 177) = 70.010, p = <.001; η 2 partial = .283) and group (F(1, 177) = 6.076, p = .015; η 2 partial = .033), but no interaction effect (F(1, 177) = 1.175, p = .280; η 2 partial = .007). As obvious from Figure 1 and descriptive data of Table 2, participants increased planning performance on average about 6.5% (i.e., about 1.5 out of 24 problem items) from the first to the second session. In addition, the parallel test–retest group performed about 5% better than the identical test–retest group across both sessions.
Note. Min = minimum; Max = maximum; SD = standard deviation; Difference score in accuracy is computed by subtracting Session 1 from Session 2.
To additionally check whether the order with which version testing has started is associated with a different learning process, the analysis above is carried out separately for both samples, but supplemented with the between-subject factor start (A versus B).
For the parallel test–retest sample, there was again a significant main effect for session (F(1, 91) = 24.056, p = <.001; η 2 partial =.209), but not for the factor start (F(1, 91) = .877, p = .351; η 2 partial = .010) or the interaction effect (F(1, 91) = 1.714, p = .194; η 2 partial = .018).
In the identical test–retest sample, there was also a significant main effect for session (F(1, 84) = 50.595, p = <.001; η 2 partial =.376), but not for start (F(1, 84) = 0.013, p = .911; η 2 partial =.000), or the interaction effect (F(1, 84) = 1.298, p = .258; η 2 partial = .015). In both samples performance increased from the first to the second session, but there was no difference between starting with version A versus B or an interaction with learning across repeated assessments.
Internal consistency (glbs)
The greatest lower bound estimations for the parallel test–retest sample were 0.765 and 0.854 for session 1 and session 2, respectively. In the identical test–retest sample, glbs were 0.806 and 0.817 for session 1 and 2, respectively.
Parallel test–retest and identical test–retest reliability
In the parallel test–retest sample, overall planning accuracy for repeated assessments with different versions (A-B and B-A) showed a Pearson correlation of r = .559 (95% confidence interval [.400, .684], p = .001), a relative consistency ICC(3,1) of r = .556 (95% CI [.398, .682], p = .001), and an absolute agreement ICC(2,1) of r = .501 (95% CI [.268, .664], p = .001).
In the identical test–retest sample, overall planning accuracy for repeated tests with identical versions (A-A and B-B) revealed a Pearson correlation of r = .708 (95% CI [.584, .800], p = .001), a relative consistency ICC(3,1) of r = .708 (95% CI [.585, .800], p = .001), and an absolute agreement ICC(2,1) of r = .605 (95% CI [.204, .791], p = .001).
To check whether numerically higher reliability in the identical test–retest sample may be related to differences in variance, we also compared the variance of the overall performance between groups with Levene’s test. As a result, sessions did not differ significantly, in line with assumed equality of variance (Session 1: F (1, 177) =0.378; p = .539; Session 2: F (1, 177) = 0.000; p = .991). This was also true for the difference between sessions: According to Levene’s test, there was no significant difference between group variances with regard to this difference (F (1, 177) = 2.119; p = .147).
Discussion
This study examined parallel and identical test–retest reliability of the TOL-F and revealed the following results: On the one hand, reliability was numerically higher for repeated assessment with the identical version compared to the parallel-test version. On the other hand, we found higher ICC absolute agreement measures than in most previous TOL studies. Except for the study by Köstering et al. (Reference Köstering, Nitschke, Schumacher, Weiller and Kaller2015), no ICC(2,1) values for absolute agreement above .45 have been published so far for any TOL version (Tyburski et al., Reference Tyburski, Kerestey, Kerestey, Radoń and Mueller2021). With ICCs(2,1) of .501 and .605 for parallel test–retest and identical test–retest reliability, we obviously exceed these values. For both results, however, it must be noted that the confidence intervals in the current study are rather large. Thus, the range of the true reliability value between both the parallel and the identical test–retest version and in comparison with previous studies does not indicate a significant difference.
The main question of this study was the comparison of retesting an identical versus an alternative version. In line with the results of Calamia et al. (Reference Calamia, Markon and Tranel2013), the identical versions achieved numerically higher reliability than alternative versions. Calamia et al. call for an ideally psychometrically identical alternative version, although this is not the case for many measurements (Calamia et al., Reference Calamia, Markon and Tranel2012). TOL-F versions A and B consist of the same set of problems, only the ball colors are interchanged (cf. Berg & Byrd, Reference Berg and Byrd2002). Thus, we concluded that the color permutation should correspond to the idea of an ideal parallel version and at least reduce item-specific learning. General learning of the task remains, but that should always be the case. Numerically, it seems that the exchange of colors can lead to different reliabilities. Nevertheless, this conclusion is restricted by the overall performance difference between the two groups. It was confirmed again that the TOL-F problem set, that is balanced according to known structural problem parameters (Kaller et al., Reference Kaller, Rahm, Köstering and Unterrainer2011), can exhibit satisfactory psychometric properties and even exceed internal consistencies established earlier (Kaller et al. Reference Kaller, Unterrainer and Stahl2012b, 2016, Unterrainer et al., Reference Unterrainer, Rahm, Kaller, Wild, Münzel, Blettner and Beutel2019). However, the present ICC values can only be rated as “moderate” (ranged between .5 and .75) according to the criteria of Portney and Watkins (Reference Portney and Watkins2000). This probably reflects the double-faced nature of executive functions and reliability. In their meta-analysis of some of the most widely used neuropsychological tests, Calamia et al. (Reference Calamia, Markon and Tranel2013) demonstrated that EF tests had poorer test–retest reliabilities compared to other measures of cognitive performance (r < .70). One explanation was the assumption that complex EF tasks involve multiple cognitive processes which makes them more susceptible to performance variability in repeated testing (Delis, Kramer, Kaplan, & Holdnak, 2004). In other words, the intended measurements of higher-order cognitive functions such as planning can also be strongly influenced by basic ongoing processes such as attention or working memory. Another explanation for lower reliability could be a learning effect that affects the second measurement: According to Strauss et al. (Reference Strauss, Sherman and Spreen2006), practice effects on EF tests can lead to a restriction of range in test scores which in turn result in lower test–retest correlations. However, this assumption is only partly consistent with the present data and the analyses of Calamia et al. (Reference Calamia, Markon and Tranel2012, Reference Calamia, Markon and Tranel2013). Probably it is not the size of the practice effect but individual changes in the rankings between the two measurement points that explain the different reliabilities (individual change of position in the second measurement, Duff, Reference Duff2012). In very homogeneous samples as in our study the range in test scores is more restricted than in representative samples of the population (Strauss et al. Reference Strauss, Sherman and Spreen2006). Participants of the same age with similar cognitive abilities suggest less variance in performance than a more heterogeneous group with large age and educational differences. Lower variances render the same ranking in the second measurement less likely and thus may also lead to lower reliability scores.
But how can the noticeably higher ICCs of about Δr = .2 in the study by Köstering et al. (Reference Köstering, Nitschke, Schumacher, Weiller and Kaller2015) be explained? After all, this study used the same TOL version as in the current parallel test–retest sample (A-B versus B-A), the test interval was identical, and the participants were students of the same age with similar intelligence scores and were recruited using the same exclusion and inclusion criteria. Apart from random sample variance, the extreme value adjustment in Köstering et al. may be an explanation. Since they studied a small sample (n = 29), they had to omit two cases deviating more than 2.5 standard deviations from the mean z-standardized between-session difference score to obtain reasonably normally distributed data. The two outliers were at the negative end of the distribution. This means that their performance in the second measurement was in the opposite direction of the whole group, which showed better performance in the second measurement. Duff (Reference Duff2012) has described how impressively test–retest reliabilities decrease when second measurements go in the contrary direction. The sample in the present study, which was more than three times larger, produced an acceptable normal distribution of the data per se, so that all values at both ends of the scale were included.
Limitations
A clear constraint is the rather homogeneous sample. A broader sample in terms of age and education would presumably allow the reliabilities to be increased even further and would offer better generalizability to the population. In addition, the recording of patient groups would be desirable. Although in both studies a total of more than 180 subjects were tested, the sample size is still below Watson’s (2004) recommendation of at least 300 participants. In order to better quantify learning effects, several retests with different time intervals should be conducted. The overall performance difference between the two groups was an undesired outcome and might be related to the time period of data collection. While the parallel test–retest assessment was finalized before the Corona pandemic, the identical test–retest reliability measurements took place during the pandemic. Testing conditions therefore were slightly different due to the need to wear a face mask and to keep a greater interpersonal distance. In addition, one may speculate that due to reduced social contact and suspended face-to-face teaching, students may have been in a generally poorer mental and emotional condition during this time.
Conclusion
Even though the reliabilities obtained were only moderate, as can commonly be observed with EF, the present study showed some of the highest psychometrics for the TOL test. The small difference in reliability values between identical and parallel versions speak in favor of using the same version, as this allows us to expect more stable results over two measurement points.
Acknowledgements
This study was not funded by any third-party public funders, foundations, or companies, but exclusively with in-house resources.
Conflicts of interest
JMU declares to receive a small proportion of the license fees for the Freiburg version of the Tower of London (TOL-F) task from the SCHUHFRIED GmbH due to authorship of the published test materials (Kaller, Unterrainer, Kaiser, et al., Reference Kaller, Unterrainer, Kaiser, Weisbrod and Aschenbrenner2012a).