Working memory refers to the mental system responsible for maintaining, processing, and using information in service of complex cognition (e.g., Baddeley, Reference Baddeley2017; see Cowan, Reference Cowan2017 for a taxonomy of definitions). Particularly among theoretical accounts of second language acquisition (SLA) that emphasize the role of attention (e.g., Leow, Reference Leow2015; Schmidt, Reference Schmidt1993; Skehan, Reference Skehan2018; VanPatten, Reference VanPatten2004), it is broadly assumed that the cognitive processes that support successful L2 development and performance rely heavily on the effective deployment of limited working memory resources. Kormos (Reference Kormos, Granena and Long2013), for instance, identifies a central role for working memory across all language learning processes, including input processing, noticing, knowledge integration, and automatization. In line with this view, individual differences in working memory capacity (WMC) have been shown to predict a wide array of L2 outcomes, including vocabulary and grammar development (e.g., Li, Reference Li2015; Li, Ellis, & Zhu, Reference Li, Ellis and Zhu2019; Martin & Ellis, Reference Martin and Ellis2012; Serafini & Sanz, Reference Serafini and Sanz2016; Zalbidea & Sanz, Reference Zalbidea and Sanz2020; for a review, see McCormick & Sanz, Reference McCormick, Sanz, Schwieter and Wen2022), reading and listening comprehension (e.g., Abu-Rabia, Reference Abu-Rabia2003; In’nami, Hijikata, & Koizumi, Reference In’nami, Hijikata and Koizumi2022; Sagarra, Reference Sagarra2017), or accuracy and complexity in L2 production (e.g., Vasylets & Marín, Reference Vasylets and Marín2021; Zalbidea, Reference Zalbidea2017). Research has also documented positive, albeit variable, links between WMC and dimensions of explicit L2 learning aptitude (e.g., Robinson, Reference Robinson and Robinson2002; Roehr & Gánem-Gutiérrez, Reference Roehr and Gánem-Gutiérrez2009; Sáfár & Kormos, Reference Sáfár and Kormos2008; Yoshimura, Reference Yoshimura, Bonch-Bruevich, Crawford, Hellermann, Higgins and Nguyen2001), a construct shown to predict L2 achievement across domains (e.g., Granena, Reference Granena2014; Saito, Reference Saito2017; Saito, Suzukida, & Sun, Reference Saito, Suzukida and Sun2019; Sparks, Humbach, Patton, & Ganschow, Reference Sparks, Humbach, Patton and Ganschow2011; Sparks, Patton, & Luebbers, Reference Sparks, Patton and Luebbers2019; Yalçin & Spada, Reference Yalçin and Spada2016; Yilmaz & Granena, Reference Yilmaz and Granena2016; for a review, see Li, Reference Li, Wen, Skehan, Biedroń, Li and Sparks2019). As will be further discussed subsequently, a key observation is that these prior L2 studies have measured WMC using complex span tasks, such as reading or operation span tasks, which combine a memory storage requirement (e.g., remembering letters) with an extraneous processing requirement (e.g., solving math equations).
Although decades of research have documented the theoretical and empirical importance of working memory as a construct in SLA, methodological discussions surrounding the measurement of WMC have only recently burgeoned (see Shin & Hu, Reference Shin, Hu, Schwieter and Wen2022; Wen, Reference Wen2016; Wen, Juffs, & Winke, Reference Wen, Juffs, Winke, Winke and Brunfaut2021). Such discussions have centered mostly around the validity of complex span tasks, such as debates about how manipulating different task parameters impacts their psychometric properties (e.g., Alptekin, Erçetin, & Özemir, Reference Alptekin, Erçetin and Özemir2014; In’nami et al., Reference In’nami, Hijikata and Koizumi2022; Linck, Osthus, Koeth, & Bunting, Reference Linck, Osthus, Koeth and Bunting2014; Shin, Reference Shin2020) or how certain measurement procedures may lead to confounding effects (e.g., Wen, Reference Wen2016; Wen et al., Reference Wen, Juffs, Winke, Winke and Brunfaut2021). For example, Alptekin et al. (Reference Alptekin, Erçetin and Özemir2014) showed that the demands of the processing requirement in the reading span test (judging if a sentence contains a morphosyntactic vs. semantic anomaly) modulates the relationship between L2 reading achievement and the processing component of this complex span measure. These studies offer important implications in relation to the administration and assessment of complex span tasks, particularly as they relate to the widely popular reading span test, and have aided in identifying factors that impact the criterion validity of these tasks in L2 research. Interestingly, however, there has been less consideration of how WMC is operationalized across different memory tests, and what the implications of researchers’ test choices can be for the theoretical interpretations of the links between WMC and L2 outcomes. These issues are worth attention, as they can lead to more grounded rationales for the selection of WMC tasks that combine storage and processing demands and, ultimately, they can help increase the explanatory and predictive utility of WMC in SLA research.
Here, we sought to contribute toward this broader goal by bringing attention to the potential of content-embedded WMC tasks. While both complex span and content-embedded tasks implement a dual-task paradigm that requires processing and maintenance of information, they differ critically in that the former demand maintenance of extraneous (i.e., task-irrelevant) memory elements during processing (e.g., memorize letters while solving math equations unrelated to said letters), while the latter demand processing and maintenance of the same (i.e., task-relevant) memory elements (see Zamary, Rawson, & Was, Reference Zamary, Rawson and Was2019). Since cognitive accounts of SLA posit that processing information that is stored in working memory is critical for successful linguistic processing and development to take place (e.g., Leow, Reference Leow2015; Schmidt, Reference Schmidt1993; Skehan, Reference Skehan2018; VanPatten, Reference VanPatten2004), it is reasonable to expect that performance in content-embedded tasks, relative to complex span tasks, would be more predictive of certain L2 outcomes.
We begin this methods forum by describing the functional differences between the two types of WMC tasks in further detail, outlining the rationale for investigating the value of content-embedded tasks for informing SLA research and theory. Next, we report and discuss initial empirical evidence on the potential of content-embedded and complex span tasks for predicting performance in tests of explicit L2 aptitude and L2 reading comprehension, two measures for which links with WMC have been documented in the literature.
Measuring working memory capacity: Complex and content-embedded tasks
Complex span tasks comprise a type of WMC task that reflects concurrent storage and processing demands (e.g., Conway, Kane, Bunting, Hambrick, Wilhelm, & Engle, Reference Conway, Kane, Bunting, Hambrick, Wilhelm and Engle2005; Engle et al., Reference Engle, Tuholski, Laughlin and Conway1999; Kane, Bleckley, Conway, & Engle, Reference Kane, Bleckley, Conway and Engle2001). Of relevance to the present work, a key characteristic of most complex span tasks—such as reading, listening, or operation span tasks—is that they require the maintenance of information that is extraneous or independent from the information being processed (see Miyake, Reference Miyake2001). In other words, such complex span tasks require storing elements in memory that are not relevant to the processing task. For instance, recent versions of the operation span task ask participants to remember letters presented one at a time in serial order (i.e., the storage component) while they judge the accuracy of a solution for a math equation (i.e., the processing component). Reading or listening span tasks are conceptually similar in design, requiring participants to remember letters, digits, or unrelated words while judging the well-formedness or plausibility of a series of sentences. As illustrated, these complex span measures tap into “the ability to maintain information that is irrelevant to the information being processed in the working memory system” (Zamary et al., Reference Zamary, Rawson and Was2019, p. 2547).
Researchers in cognitive psychology have also employed another category of WMC measures, which we refer to here as content-embedded tasks, following Was, Rawson, Bailey, and Dunlosky (Reference Was, Rawson, Bailey and Dunlosky2011) and Zamary et al. (Reference Zamary, Rawson and Was2019). As the aforementioned complex span tasks, content-embedded tasks are also dual tasks involving processing and storage requirements (Ackerman, Beier, & Boyle, Reference Ackerman, Beier and Boyle2002; Kyllonen & Cristal, Reference Kyllonen and Christal1990). However, content-embedded tasks are functionally different from complex span tasks in that they require the maintenance and processing of the same memory elements. In this sense, content-embedded tasks involve maintaining “task-relevant” information. For example, in the ABCD task, participants are asked to consider three premises about the order of the letters A, B, C, and D with the goal of creating a string in which the letters are arranged in a particular order. They read about the ordering of A and B first (e.g., “A comes after B”), followed by the ordering of C and D (e.g., “D comes before C”), and then the ordering of the two sets of letters (e.g., “Set 1 comes after Set 2”). Participants are instructed to indicate the solution afterward (in this example, DCBA). As shown, the information being maintained for later recall (e.g., DCBA) is dependent on the processed information (i.e., premises about letter and set ordering). Therefore, unlike the functional requirements of complex span tasks, content-embedded tasks ask participants to maintain relevant information—as opposed to extraneous—for processing.Footnote 1
To our knowledge, no empirical L2 studies to date have relied on content-embedded tasks to measure participants’ WMC. SLA research examining the contributions of WMC to L2 outcomes, including research on L2 aptitude and L2 reading comprehension, has prioritized the use of complex span tasks. Among them, the reading span test remains the most commonly administered measure, although some scholars have cautioned against using it as a single WMC measure given its language-reliant processing component (see, e.g., Juffs & Harrington, Reference Juffs and Harrington2011; Shin & Wu, Reference Shin, Hu, Schwieter and Wen2022; Wen, Reference Wen2016). This state of affairs is well illustrated, for instance, with a frequency analysis of the WMC tests considered in the studies in In’nami et al.’s (Reference In’nami, Hijikata and Koizumi2022) meta-analysis on the link between WMC and L2 reading achievement. As summarized in Table 1, out of the 68 empirical studies included in their sample, around 69% of them administered a reading span test. With a notable difference, the next most common complex span measure was the operation span test, administered in almost 15% of the studies.
Table 1. Distribution of WMC tests in studies included in In’nami et al. (Reference In’nami, Hijikata and Koizumi2022)

Of interest to SLA, research in cognitive psychology indicates that both complex span and content-embedded tasks are positively associated with higher-order cognitive skills, such as reading comprehension (e.g., Unsworth & Engle, Reference Unsworth and Engle2007; Was & Woltz, Reference Was and Woltz2007) or reasoning ability (e.g., Kyllonen & Christal, Reference Kyllonen and Christal1990). Two key empirical studies have directly compared the predictive value of complex span and content-embedded tasks: Was et al. (Reference Was, Rawson, Bailey and Dunlosky2011) predicted that content-embedded tasks (ABCD, digit, and alphabet tasks) would explain a greater amount of variance in L1 reading comprehension, as assessed by a series of standardized tests, relative to complex span tasks (reading, operation, and counting span tasks). They found that the ABCD and digit tasks loaded most strongly onto a content-embedded latent factor and that the reading span and operation span tasks did so onto a complex span latent factor. As expected, content-embedded tasks were more strongly linked to L1 reading comprehension scores than complex span tasks, highlighting the utility of the former for predicting cognitive activities that require processing of information maintained in memory.
In a more recent study using the same WMC measures, Zamary et al. (Reference Zamary, Rawson and Was2019) hypothesized that content-embedded tasks would also explain more variance in inductive reasoning than complex span tasks. Their prediction was driven by the fact that inductive reasoning requires finding common connections among stimulus elements through explicit thought processes, such as hypothesis formation and testing (e.g., Johnson-Laird & Khemlani, Reference Johnson-Laird, Khemlani and Ross2013; Klauer, Willmes, & Phye, Reference Klauer, Willmes and Phye2002). In line with Was et al. (Reference Was, Rawson, Bailey and Dunlosky2011), they found that the ABCD and digit tasks loaded most strongly onto a content-embedded latent factor, whereas the reading span and operation span tasks loaded most strongly onto a complex span latent factor. Their results also highlighted the advantages of content-embedded tasks relative to complex span tasks for explaining variance in inductive reasoning skills. Together, both studies provide evidence indicating that cognitive processes that require the maintenance of task-relevant information can be more strongly predicted by content-embedded tasks than complex span tasks.
Building on this argument and considering prior empirical findings in cognitive psychology, the present study begins exploring the utility of content-embedded tasks in L2 research relative to complex span tasks. As noted, it is uncontroversial to claim that working memory plays a central role in regulating and orchestrating the attentional resources that are implicated in key acquisitional processes, particularly at the beginning stages of SLA in intentional learning contexts, such as those typically found in classroom instruction. Such processes include input processing, conscious noticing of forms, the initial creation of form-meaning links, or the identification of patterns and rules, among others (e.g., Kormos, Reference Kormos, Granena and Long2013; McCormick & Sanz, Reference McCormick, Sanz, Schwieter and Wen2022; Wen, Reference Wen2016). WMC as measured by both complex span and content-embedded tasks is expected to positively relate to linguistic outcomes that are subserved by such processes. However, because manipulating information that is stored in working memory is considered particularly important for successful L2 processing and development, it is reasonable to expect content-embedded tasks to show certain predictive advantages in some cases.
To begin exploring this hypothesis, we consider the predictive value of the most popular complex span tasks (reading span and operation span tests) and established content-embedded tasks (ABCD and digit tests) in connection with L2 learners’ performance in explicit L2 aptitude and L2 reading comprehension tests. We chose to focus on explicit L2 aptitude—specifically, as it relates to language analytic and associative learning skills—and L2 reading comprehension here as a first step because their associations with WMC have been long theorized and continue to receive ample empirical interest. Regarding aptitude, both language analysis and associative learning involve cognitive abilities that require intentionally searching and working out relations and rules in datasets, and thus one’s ability to coordinate the processing and maintenance of task-specific memory elements, as gauged by content-embedded tasks, should serve an important role. Similarly, performance in content-embedded tasks should be particularly predictive of L2 reading comprehension, as readers must maintain relevant textual material in memory while simultaneously integrating it with new incoming material and their long-term prior knowledge to build a coherent mental representation. Effective storage and processing of task-specific textual elements should therefore be critical in managing key reading processes, such as making inferences or updating interpretations.
In sum, while both theory and research indicate connections between WMC and performance on explicit L2 aptitude and L2 reading comprehension tests, we explore the extent to which WMC is linked to individual variability in these criterion measures when operationalized using content-embedded and complex span tasks. In what follows, we provide some conceptual and empirical background to contextualize our results within the broader literature on WMC and its connections to explicit L2 aptitude and L2 reading comprehension.
Working memory capacity and explicit L2 aptitude
Language learning aptitude refers to a “conglomerate of different abilities that can assist in the different stages and processes of language learning” (Kormos, Reference Kormos, Granena and Long2013, p. 141). L2 aptitude has been shown to predict successful L2 learning processes and outcomes, particularly in instructed contexts (e.g., Bylund, Abrahamsson, & Hyltenstam, Reference Bylund, Abrahamsson and Hyltenstam2010; Granena, Reference Granena, Granena and Long2013; Li, Reference Li2015, Reference Li2016, Reference Li, Wen, Skehan, Biedroń, Li and Sparks2019; Kormos & Sáfár, Reference Kormos and Sáfár2008; Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson and Doughty2013; Saito, Reference Saito2017; Saito et al., Reference Saito, Suzukida and Sun2019; Sparks et al., Reference Sparks, Humbach, Patton and Ganschow2011, Reference Sparks, Patton and Luebbers2019; Yalçin & Spada, Reference Yalçin and Spada2016; Yilmaz, Reference Yilmaz2013; Yilmaz & Granena, Reference Yilmaz and Granena2016). Explicit L2 learning aptitude specifically has been operationalized using tests such as the Modern Language Aptitude Test (MLAT; Carroll & Sapon, Reference Carroll and Sapon1959), the Pimsleur Language Aptitude Battery (PLAB; Pimsleur, Reed, & Stansfield, Reference Pimsleur, Reed and Stansfield2004), or the LLAMA tests (Meara, Reference Meara2005). As noted by Li and Zhao (Reference Li and Zhao2021), these tests are considered explicit to the extent that they “require learners to engage in conscious, effortful information processing” (p. 39).
Researchers have also been interested in examining the cognitive abilities that underlie explicit L2 aptitude (see Li, Reference Li, Li, Hiver and Papi2022). Since working memory comprises the cognitive system necessary for the temporary storage and manipulation of information, making it the central site for conscious processing (e.g., Baddeley, Reference Baddeley2017; see Cowan, Reference Cowan2017 for an overview of definitions), it is reasonable to posit a strong link between an individual’s WMC and their aptitude for explicit L2 learning (e.g., Robinson, Reference Robinson and Robinson2002, Reference Robinson2005; Sawyer & Ranta, Reference Sawyer, Ranta and Robinson2001; Skehan, Reference Skehan2018; Wen, Reference Wen2016). More specifically, the conscious and effortful mental processes that underlie successful performance on explicit associative learning or inductive grammatical learning tests, where learners are asked to work out or infer relationships between elements, are assumed to depend heavily on one’s capacity for the temporary storage and manipulation of information during online mental operations.
In line with this view, positive links have been reported between WMC and explicit L2 aptitude, although these connections have often been weaker than anticipated. Yoshimura (Reference Yoshimura, Bonch-Bruevich, Crawford, Hellermann, Higgins and Nguyen2001) found that L1 and L2 reading span tests were positively related to the language analytic and word association subtests of the Language Aptitude Battery for Japanese (LABJ; Sasaki, Reference Sasaki1996). Robinson (Reference Robinson and Robinson2002) also reported a moderate positive relationship between the LABJ and L1 reading span scores. Similarly, Sáfár and Kormos (Reference Sáfár and Kormos2008) found a medium-sized positive correlation between L1 Hungarian-L2 English learners’ WMC, measured by a backward digit span testFootnote 2, and a language analysis subset adapted from the PLAB, as well as total scores in a Hungarian L2 aptitude test. Similarly, Yalçin, Çeçen, and Erçetin (Reference Yalçin, Çeçen and Erçetin2016) reported positive weak-to-moderate correlations between L1 Turkish-L2 English learners’ L1 and L2 reading span scores (but not operation span scores) and the LLAMA-F as well as the LLAMA total score. Roehr and Gánem-Gutiérrez (Reference Roehr and Gánem-Gutiérrez2009), conversely, found no significant correlations between L1 and L2 reading span tests and MLAT scores among L1 English learners of L2 German and Spanish. Lastly, Li’s (Reference Li2016) meta-analysis reported moderate to weak positive correlations between WMC and L2 aptitude components but also called for a more nuanced investigation of the links between WMC and different aspects of L2 aptitude.
In all, findings to date have been variable, leading to interpretations that suggest a reduced role of working memory skills in relation to some L2 aptitude measures (see, e.g., Yalçin et al., Reference Yalçin, Çeçen and Erçetin2016). As noted by Li (Reference Li, Li, Hiver and Papi2022), although WMC and L2 aptitude are related to one another, the constructs appear to be differentiated. Moreover, of relevance to this study, one important methodological observation is that prior research has relied only on complex span measures (primarily the reading span test) to assess learners’ WMC.
Working memory capacity and L2 reading comprehension
Reading is a complex endeavor that encompasses various sources of linguistic and nonlinguistic knowledge, as well as substantive attentional resources. Working memory is hypothesized to serve a central role in a wide range of cognitive processes involved in skilled L1 and L2 reading, such as the encoding of sentence-level information and the inference and integration of messages across the text (e.g., Daneman & Carpenter, Reference Daneman and Carpenter1980; Harrington & Sawyer, Reference Harrington and Sawyer1992; Koda, Reference Koda2005). More specifically, Li and D’Angelo (Reference Li, D’Angelo, Chen, Dronjic and Helms-Park2016) explain that working memory “serves as a buffer for recently read text, enabling the reader to make meaningful connections within and between sentences, while retrieving information from long-term memory to facilitate comprehension through the integration of background knowledge” (p. 162). Indeed, decades of research have documented that WMC is an important determinant of L1 reading comprehension (for meta-analytic reviews, see, e.g., Carretti, Borella, Cornoldi, & De Beni, Reference Carretti, Borella, Cornoldi and De Beni2009; Peng et al., Reference Peng, Barnes, Wang, Wang, Li, Swanson and Tao2018).
Working memory is also expected to be strongly implicated in L2 reading comprehension, even more so than in the L1. Since L2 knowledge tends to be more limited and less automated, the type of processing involved in L2 reading is typically more effortful and places heavier demands on learners’ limited working memory resources (e.g., Alptekin & Erçetin, Reference Alptekin and Erçetin2009; Rai, Loschky, & Harris, Reference Rai, Loschky and Harris2015). While extensive research has investigated the role of WMC in L2 reading (e.g., Leeser, Reference Leeser2007; Sagarra, Reference Sagarra2017), findings to date have also been far from consistent, as highlighted in In’nami et al.’s (Reference In’nami, Hijikata and Koizumi2022) most recent meta-analysis on the relationship between WMC and L2 reading comprehension. Their results revealed a small synthetic correlation (r = .30) between the two constructs, which the authors interpreted as indicative of a “weak relationship” (p. 16). Additionally, they found that correlations were larger when WMC measures were administered in the L2 rather than the L1, which can be explained by the fact that L2 WMC tests, unlike L1 or language-neutral ones, “are subject to learners’ L2 proficiency” (Li, Reference Li2023, p. 668). Conversely, correlations were not moderated by factors such as memory task modality (verbal, nonverbal, combined), L2 proficiency, or L2 reading comprehension test standardization. These observations largely align with and expand prior meta-analytic findings (e.g., Jeon & Yamashita, Reference Jeon and Yamashita2014; Linck et al., Reference Linck, Osthus, Koeth and Bunting2014; Shin, Reference Shin2020). Similar to what was observed for explicit L2 aptitude research, it is notable that the reading span test remains the most commonly used test in this line of inquiry, while content-embedded tasks have remained unexplored.
Building on earlier research, the present study begins to examine the utility of complex span and content-embedded tasks for predicting measures of explicit L2 aptitude and L2 reading comprehension among L1 English learners of L2 Spanish. We expected complex span and content-embedded tasks to show similarities since both types of tasks tap into the maintenance and processing of information. Nonetheless, considering that content-embedded tasks are designed to measure the ability to maintain and process task-relevant information—which is deemed critical for successful performance in explicit L2 aptitude and L2 reading comprehension tests—we hypothesized that they would show certain advantages over complex span tasks.
Before describing the study methodology, it is relevant to acknowledge that, while the present study treats explicit L2 aptitude and L2 reading comprehension as criterion variables to assess the predictive utility of complex span and content-embedded tasks, the theoretical connections between WMC and these constructs continue to be a matter of discussion, particularly the directionality and nature of the working memory-aptitude link (see, e.g., Kormos, Reference Kormos, Granena and Long2013; Jackson, Reference Jackson2020; Robinson, Reference Robinson2005). Highlighting that both play important roles in conscious L2 learning, Li (Reference Li, Li, Hiver and Papi2022) notes that WMC and explicit L2 aptitude “are significantly correlated with, and yet dissociable from” each other (p. 39), as suggested by the meta-analyzed findings in Li (Reference Li2016). We recognize, in line with recent proposals and as brought up by an anonymous reviewer, that the interconnections between WMC and L2 aptitude, among other cognitive variables, can be conceptualized as dynamic and reciprocal (see Jackson, Reference Jackson2020; Serafini, Reference Serafini, Kersten and Winsler2022).
That said, we wish to emphasize that our research was not aimed at contributing to theoretical debates regarding the working memory-aptitude link, but rather sought to provide an initial testing ground for advancing methodological discussions surrounding the functionality of different types of WMC tests. Since we did not seek to maximize explained variance in the study’s L2 outcomes, but rather to consider how the two WMC tests behaved, the complex span and content-embedded tasks were treated as sole cognitive predictors of both L2 aptitude and L2 reading comprehension. Investigating the connection between L2 aptitude and L2 reading comprehension was outside the scope of this work.
Method
Participants
The final sample included 42 undergraduate L1 English learners of L2 Spanish from two public universities in the United States.Footnote 3 Participants, whose mean age was 19.79 years (SD = 2.16; 32 female, 10 male), were enrolled in Spanish language courses and reported studying various majors (e.g., political science, biology, Spanish, social work, neuroscience). They had begun learning Spanish as a foreign language at around 12.69 years of age (SD = 3.95) and indicated a mean total of 5.88 years (SD = 3.50) of formal Spanish instruction, combining classroom experience at university (M = 1.45, SD = 1.59), high school (M = 2.54, SD = 1.34), middle school (M = 1.32, SD = 1.16), and elementary school (M = .82, SD = 1.68). Eight participants reported having some working knowledge of an additional foreign language (e.g., German, Japanese). Four participants indicated having studied abroad in a Spanish-speaking country in programs ranging from 2 weeks to 4.5 months in duration.
Materials
As part of a larger project, the study comprised two data collection sessions, each lasting approximately 1.5 hours. All data were collected individually in a research lab and participants received a $50 Amazon gift card.
Complex Span Tasks
We employed the short versions of the reading span (RSpan) and operation span (OSpan) tasks developed and validated by Oswald, McAbee, Redick, and Hambrick (Reference Oswald, McAbee, Redick and Hambrick2015). Both tasks are available at https://englelab.gatech.edu/taskdownloads and were administered in participants’ L1 (English) using E-Prime 2.0 (Psychology Software Tools Inc., Pittsburgh, PA). We opted for the abridged automated versions of both complex span tasks because their shorter administration time better aligned with the administration time of the content-embedded tasks of this study. Since limited testing time is a common logistic concern for SLA researchers, and our aim here was to compare the predictive utility of each task type, we found it pertinent to ensure that test administration time remained as similar as possible.
Reading span task
In the RSpan task, participants read a set of sentences of approximately 10-15 words in length and are instructed to judge whether or not each sentence is semantically sensible. After each sentence, participants are shown a letter for recall at the end of each set. Set sizes range from four to six sentences, and there are two administrations for each set size. The time limit for each processing-storage trial is automatically set to be equivalent to 2.5 standard deviations above the mean time for the processing-only answers that participants provide during the practice trials.
Operation span task
In the automated OSpan task, participants read a set of arithmetic operations and are instructed to judge whether a proposed solution to each math equation is true or false. Following each operation, participants are shown a letter for recall at the end of the set. Set sizes ranged from four to six operations, and there are two administrations for each set size, as in the RSpan. The time limit for each processing-storage trial is also set to be equivalent to 2.5 standard deviations above the mean time for the processing-only answers that participants provide during the practice trials.
Content-embedded tasks
The ABCD and digit tasks were adapted from Zamary et al. (Reference Zamary, Rawson and Was2019) (see also Was et al., Reference Was, Rawson, Bailey and Dunlosky2011) and also administered in participants’ L1 (English) using E-Prime 2.0. Variants of both content-embedded tasks have been employed in prior research (e.g., Ackerman et al., Reference Ackerman, Beier and Boyle2002; Kyllonen & Christal, Reference Kyllonen and Christal1990; Shipstead, Harrison, & Engle, Reference Shipstead, Harrison and Engle2016; Was & Woltz, Reference Was and Woltz2007). The E-Run files for the ABCD and digit tasks can be accessed in the IRIS repository.
ABCD task
The ABCD task asks participants to pay attention to three premises to establish the order of the letters A, B, C, and D. Participants are first provided with information about the ordering of the letters A and B (e.g., “B comes after A”). Next, participants click a button to replace the first premise with one about the ordering of the letters C and D (e.g., “D comes before C”). This process is repeated to again replace the second premise with one about the ordering of the two letter sets (e.g., “Set 2 comes before Set 1”). After clicking a button, participants are shown the eight possible orderings of the four letters and are instructed to select the right answer (see Figure 1). This cycle is repeated for each of the 23 trials. All trials are self-paced, although participants are asked to respond as quickly as possible. Three practice trials were included.

Figure 1. Sample response screen from the ABCD task.
Digit task
In the digit task, participants are shown six single-digit numbers one at a time for 2 seconds each. Afterward, participants respond to one or two questions about the number string (e.g., “How many even numbers are there?,” “What is the sum of the first two numbers?”). Participants are told to hold the numbers in memory to answer the questions accurately. All responses are numeric, and participants indicate their answers using the keyboard (see Figure 2). The response slide is untimed, but participants are instructed to respond as quickly and accurately as they can. The task includes a block of 12 single-question trials followed by an additional block of 12 double-question trials. Participants completed four practice trials that included single questions similar to those in the critical items.

Figure 2. Sample response screen from the digit task.
Explicit L2 aptitude test
Explicit L2 aptitude measures included two tests of language analytic ability and two tests of associative learning skills. The first measure of language analytic ability was Part 4 (Language Analysis) of the PLAB (Pimsleur et al., Reference Pimsleur, Reed and Stansfield2004), which gauges participants’ ability to reason logically with relation to a foreign language by asking them to figure out an artificial language grammar. The test includes directions with sample items and 15 critical items in which participants have to select the correct translation in the artificial language (out of four possible options) for each English phrase or sentence provided.Footnote 4 The second measure was subtest F of the computerized LLAMA (v. 2) aptitude battery (LLAMA-F; Meara, Reference Meara2005), which also tests participants’ ability to infer the grammatical system of an unknown language. Participants see a set of 20 pictures and accompanying sentences as they try to deduce the rules that govern the language. There is a 5-minute study phase, followed by the self-paced test phase.
For associative learning, the first measure was Part V (Paired Associates) of the MLAT (MLAT-5; Carroll & Sapon, Reference Carroll and Sapon1959), which tests rote learning ability. Participants are instructed to learn 24 English-Kurdish word pairs in 2 minutes. Afterward, there is a 2-minute practice phase where they are given the chance to write down the English equivalents, with the option to look at the study page if needed. Lastly, they are given 4 minutes to complete a multiple-choice test where they have to select the English equivalent of each Kurdish word (out of four possible options). The second measure was subtest B of the LLAMA (v. 2) aptitude battery. Participants have to learn as many associations as possible from a set of 20 novel words from a Central-American language and their corresponding images. They have a 2-minute study phase, followed by the self-paced test phase.
We selected these four tests because they operationalize language analytic ability and associative learning skills in a highly comparable manner, and thus anticipated that they would effectively capture common aspects of explicit L2 aptitude. Both the PLAB-4 and the LLAMA-F require participants to infer and apply grammatical rules, such as rules about word-internal morpheme order, in an unknown artificial language. The LLAMA-B and MLAT-5 also gauge participants’ ability to learn novel words similarly, by asking them to form and memorize connections with L1 translations or pictures.
L2 reading comprehension test
The L2 reading comprehension test comprised two subtests at the levels A1 and A2 from the Common European Framework of Reference for Languages, adapted from the standardized Diploma de Español como Lengua Extranjera by the Instituto Cervantes (see Appendix S1 online for materials). Both subtests followed the same format, instructing participants to read a short text and then answer five multiple-choice questions, for a total of 10 items. Since our participants were recruited from various curricular levels, they were told to complete the reading comprehension test at their own pace, taking as much time as needed within reasonable limits. Time on task was recorded and then employed as a covariate in the statistical analyses, as we report subsequently.
To ensure the test’s suitability for the US-based Spanish foreign language context, we sought the expertise of four native speakers representing various Latin-American Spanish varieties (Mexico, Puerto Rico, Colombia, Ecuador), all of whom had experience teaching Spanish in the US at the university level. They were tasked with reviewing the texts and identifying any linguistic elements that they found challenging to understand or were not commonly used in their respective varieties. Additionally, we requested feedback from the basic Spanish language program director of a large public US university to also ascertain the test’s appropriateness for students at lower curricular levels. Based on their input, we made any necessary changes to ensure that the texts and questions were in line with the linguistic expectations of US-based L2 Spanish learners. For instance, we replaced the word “piso” for “apartamento,” since the former is primarily used in Spain to refer to an apartment and may not be as familiar to learners in the US.
Scoring and reliability analyses
Complex span and content-embedded tasks
Descriptive statistics for all WMC measures are summarized in Table 2. Performance in the complex span tasks was represented as the total score, following Conway et al. (Reference Conway, Kane, Bunting, Hambrick, Wilhelm and Engle2005), which reflects the number of correctly recalled elements across all trials regardless of their order (see Oswald et al., Reference Oswald, McAbee, Redick and Hambrick2015).Footnote 5 All participants scored above the 50% accuracy threshold in the processing portion of both tasks, as advised by Richmond et al. (Reference Richmond, Burnett, Morrison and Ball2022).Footnote 6 The mean processing accuracy score was 28.20 (SD = 1.29; Min-Max = 25-30) and 28.00 (SD = 1.77, Min-Max = 23-30) for the RSpan and OSpan tasks, respectively. Cronbach’s alpha was .61 and .73 for the RSpan and OSpan tasks, respectively.
Table 2. Descriptive statistics for the WMC tests

Performance in the content-embedded tasks was calculated as the number of correct responses per minute, following Zamary et al. (Reference Zamary, Rawson and Was2019).Footnote 7 For the digit task, only time spent on the response screen was considered, as the stimulus presentation time for each digit was fixed.Footnote 8 Cronbach’s alpha was .90 and .75 for the ABCD and digit tasks, respectively.Footnote 9
A principal component analysis with oblimin rotation and Kaiser’s criterion of eigenvalue > 1.0 (Bartlett’s sphericity test, X2 (6) = 67.07, p <.001, KMO = .70) revealed that all four WMC measures loaded on a single component (RSpan = .82; OSpan = .74; ABCD = .84; digit = .85), which was expected considering that all tests were designed to tap into the maintenance and processing of information. Fixing a binary component structure led to the RSpan (.75) and OSpan (.97) tests loading strongly onto one component, and the ABCD (.96) and digit (.91) tests loading strongly onto another component, indicating that each of the two complex span and content-embedded tasks pattern together (see Appendix S2 online).
Explicit L2 aptitude
Descriptive statistics for the aptitude measures are displayed in Table 3. Participants received 1 point for each accurate response to the 15 and 24 multiple-choice items on the PLAB-4 and MLAT-5, respectively. The LLAMA-F and LLAMA-B were automatically scored by the program on a 0-100 scale. Cronbach’s alpha for the PLAB-4 and MLAT-5 tests was .77 and .87, respectively. Reliability statistics could not be computed for the LLAMA (v.2) tests because the software did not record item-level data.
Table 3. Descriptive statistics for the explicit L2 aptitude tests

A principal component analysis with oblimin rotation and Kaiser’s criterion of eigenvalue > 1.0 (Bartlett’s sphericity test, X2 (6) = 37.19, p <.001, KMO = .73) revealed that all four tests load on a single component indicative of explicit L2 aptitude (PLAB-4 = .75; LLAMA-F = .71; MLAT-5 = .83; LLAMA-B = .73) (see Appendix S3 online).
L2 reading comprehension
Participants received 1 point for each correct response to the 10 multiple-choice questions in the reading comprehension test. The average score was 6.40 points (SD = 2.18; Min-Max = 2-10). Cronbach’s alpha for this test was .65, indicating adequate instrument reliability given the relatively small item set. Participants took an average of 8.04 minutes (SD = 3.10; Min-Max = 4.40-17.40) to complete the test.Footnote 10 While we provide the continuous mean score for this test here as part of the descriptive statistics, the dichotomous (correct/incorrect) item-level data were employed in the analyses, as we describe subsequently.
Analysis
We first ran a correlation matrix of all the standardized measures of the study to explore connections at the task level, using all available data. Next, to examine the utility of complex and content-embedded tasks for predicting explicit L2 aptitude and L2 reading comprehension, a series of mixed-effects models were computed. To limit the number of tests and coefficients in the models, we derived composite variables for Complex Span (mean for z-OSpan and z-RSpan scores) and Content-embedded (mean for z-ABCD and z-digit scores) WMC estimates. For explicit L2 aptitude, we calculated two separate linear mixed-effects models with proportion correct as the dependent variable, Complex Span and Content-embedded WMC as fixed effects, and Participant and Test as random effects, since individual accuracy data were considered across four aptitude tests. For L2 reading comprehension, we computed two separate logit mixed-effects models with accuracy (1 = correct, 0 = incorrect) on the L2 reading comprehension test as the dependent variable, Complex Span and Content-embedded WMC as fixed effects, and Participant and Item as random effects. We also included time on task (in minutes, standardized) as a fixed effect covariate in the baseline model. Random slopes were not retained in either set of models due to singular fits. For both dependent variables, nested models were compared using chi-squared tests to examine if inclusion of fixed effects improved model fit. Descriptive and inferential analyses were performed in jamovi (v.2.3.28.0), with the exception of the mixed-effects models, which were fitted and compared in R (v.4.1.1.) using the lme4 (Bates, Mächler, Bolker, & Walker, Reference Bates, Mächler, Bolker and Walker2015), lmerTest (Kuznetsova, Brockhoff, & Christensen, Reference Kuznetsova, Brockhoff and Christensen2017), and jtools (Long, Reference Long2022) packages. Models were estimated using maximum likelihood, with listwise deletion of observations, and optimized with BOBYQA.
Results
Table 4 displays the correlation matrix of all task-level measures. Regarding the complex span tasks, only the RSpan showed significant positive correlations with multiple measures of L2 aptitude and reading comprehension. Both content-embedded tasks evidenced significant positive associations with various measures. The Complex Span WMC and Content-embedded WMC composites were also significantly correlated, r = .57, p <.001. Of all the significant correlation coefficients, only the bootstrapped 95% CIs for the coefficient between the RSpan and the MLAT-5 crossed zero (see Appendix S4 online for full table). We now turn to describing results from the mixed-effects models.
Table 4. Correlation matrix: WMC, L2 aptitude, and L2 reading comprehension tests

Note: L2RCT = L2 Reading Comprehension Test.
*p < .05, **p < .010, ***p < .001
WMC and explicit L2 aptitude
Tables 5 and 6 display the estimates for the linear mixed-effects models containing Complex Span WMC and Content-embedded WMC as fixed effects, respectively. Both WMC variables were significant positive predictors of explicit L2 aptitude. The marginal R2, representing the variance explained by fixed effects, was slightly larger for the Content-embedded WMC model (.11) than for the Complex Span WMC model (.08).
Table 5. Complex span WMC and L2 aptitude: Estimates for linear model

Note: R2 marginal = .08; R2 conditional = .58.
Table 6. Content-embedded WMC and L2 aptitude: Estimates for linear model

Note: R2 marginal = .11; R2 conditional = .57.
To probe deeper into the contributions of each WMC predictor, a series of nested models (see Table 7) were compared using chi-square tests. A model with Complex Span WMC as a fixed effect (Model 1a) provided a significantly better fit than the random-effects baseline model (Model 0), with only participant- and test-level intercepts, χ2 (1) = 8.90, p = .003, as did a model with Content-embedded WMC as a fixed effect (Model 1b), χ2 (1) = 13.32, p <.001. Additionally, a more complex model that included both WMC measures (Model 2) provided a significantly better fit than a model with only Complex Span WMC as a fixed effect (Model 1a), χ2 (1) = 6.14, p = .013, but this was not the case when compared to a model with only Content-embedded WMC as a fixed effect (Model 1b), χ2 (1) = 1.71, p = .191. That is, while adding Content-embedded WMC into a model that included Complex Span WMC as a fixed effect significantly improved model fit, the opposite was not true, suggesting that Content-embedded WMC explains variance beyond Complex Span WMC.
Table 7. Explicit L2 aptitude: Summary of models

WMC and L2 reading comprehension
Tables 8 and 9 display the estimates for the logit mixed-effects models with Complex Span WMC and Content-embedded WMC as fixed effects, respectively. While only Content-embedded WMC emerged as a significant coefficient, the odds ratio linked to both WMC predictors followed the expected positive pattern of association. For each unit increase in Content-embedded WMC, the odds of scoring accurately on the L2 reading comprehension test increase by 1.63, compared to the slightly lower value of 1.54 associated with each unit increase in Complex Span WMC. In line with the results for explicit L2 aptitude, the marginal R2 was slightly larger for the Content-embedded WMC model (.04) than the Complex Span WMC model (.02).
Table 8. Complex span WMC and L2 reading comprehension: Estimates for logit model

Note: R2 marginal = .02; R2 conditional = .40.
Table 9. Content-embedded WMC and L2 reading comprehension: Estimates for logit model

Note: R2 marginal = .04; R2 conditional = .40.
Nested models (see Table 10) were again compared using chi-square tests. A model with Complex Span WMC as a fixed effect (Model 1a) provided a descriptively better fit than the baseline model (Model 0), with only participant- and item-level random intercepts and time on task as a covariate, but this difference was not statistically significant, χ2 (1) = 3.28, p = .070. In contrast, a model with Content-embedded WMC as a fixed effect (Model 1b) provided a significantly better fit than the baseline, χ2 (1) = 4.79, p = .029. A model with both WMC predictors (Model 2) did not provide a significantly better fit than one with either only Complex Span WMC (Model 1a), χ2 (1) = 2.16, p = .141, or Content-embedded WMC as a fixed effect (Model 1b), χ2 (1) = .64, p = .422. In sum, unlike for explicit L2 aptitude, only a model with Content-embedded WMC as a fixed effect provided a significantly better fit to the data than the baseline model.
Table 10. L2 reading comprehension: Summary of models

Supplemental analyses
In line with Zamary et al.’s (Reference Zamary, Rawson and Was2019) and Was et al.’s (Reference Was, Rawson, Bailey and Dunlosky2011) reasoning, we interpret these results in connection with the functional differences previously described between content-embedded and complex span tasks. However, following Zamary et al. (Reference Zamary, Rawson and Was2019), we decided to conduct a series of supplemental analyses to discount alternative, artifact-related explanations for our results.
Supplemental model set 1
First, even though complex span tasks are typically scored using total span performance accuracy on the storage component of the task (see Conway et al., Reference Conway, Kane, Bunting, Hambrick, Wilhelm and Engle2005), individuals’ performance on the processing component (i.e., judging sentences or solving math equations) can also vary nontrivially (see Richmond et al., Reference Richmond, Burnett, Morrison and Ball2022). It is possible that scoring complex span tasks by considering performance on both the storage and processing components increases their predictive value in relation to explicit L2 aptitude. To test this, we ran a series of parallel models with a composite Complex Span WMC variable derived from the mean of the standardized total span and processing accuracy scores in the RSpan and OSpan tasks (see Appendix S5 online). To summarize, the models produced the same qualitative pattern of results as the main analyses:
Regarding explicit L2 aptitude, a model with the new Complex Span WMC composite as a fixed effect provided a significantly better fit than the baseline model, χ2 (1) = 9.62, p = .002. A fuller model that included both WMC predictors also provided a significantly better fit than a model with only the new Complex Span WMC composite as a fixed effect, χ2 (1) = 5.28, p = .021, but this was not the case when compared to a model with only Content-embedded WMC as a fixed effect, χ2 (1) = 1.56, p = .211.
Regarding L2 reading comprehension, a model with Complex Span WMC as a fixed effect did not provide a significantly better fit than the baseline model, χ2 (1) = 2.86, p = .091. Additionally, a model with both cognitive predictors did not provide a significantly better fit than a model with either Complex Span WMC, χ2 (1) = 2.14, p = .143, or with Content-embedded WMC as a fixed effect, χ2 (1) = .21, p = .645.
Supplemental model set 2
Lastly, an additional set of alternative models were computed without considering OSpan performance, that is, with only the RSpan storage and processing accuracy scores included in the composite (see Appendix S6 online). While this supplemental analysis entailed excluding part of the data, we found it to be pertinent given that the RSpan remains the most widely employed WMC measure in SLA research, with usage frequencies far exceeding those of the OSpan.
For explicit L2 aptitude, the additional models also yielded the same pattern of results as our main analyses: a model with the RSpan estimate as a fixed effect provided a significantly better fit than the baseline model, χ2 (1) = 9.30, p = .002. A fuller model that included both cognitive predictors also provided a significantly better fit than a model with only RSpan as a fixed effect, χ2 (1) = 5.37, p = .021, but this was not the case when compared to a model with only Content-embedded WMC as a fixed effect, χ2 (1) = 1.34, p = .246.
For L2 reading comprehension, results departed slightly from the main models. In this case, a model with RSpan as a fixed effect did provide a significantly better fit than the baseline model, χ2 (1) = 4.17, p = .041. However, a model with both WMC predictors did not provide a significantly better fit than a model with either only RSpan as a fixed effect, χ2 (1) = 1.44, p = .230, or one with only Content-embedded WMC as a fixed effect, χ2 (1) = .82, p = .365.
Discussion
The main goal of this methods forum was to begin exploring the potential value of content-embedded tasks for advancing SLA research. To this end, we first discussed their functional differences relative to complex span tasks and then made a case to consider them in light of core theoretical assumptions in cognitive accounts of L2 processing and development. Next, we reported preliminary empirical evidence suggesting that both content-embedded tasks and complex span tasks can predict measures of explicit L2 aptitude and L2 reading comprehension, but that content-embedded tasks can show stronger links relative to complex span tasks in some cases. Specifically, we found that explicit L2 aptitude was significantly predicted by both task types, but also that the inclusion of content-embedded WMC improved model fit when complex span WMC was already considered. This suggests that content-embedded WMC tasks provide explanatory power beyond that which is offered by complex span WMC tasks with regards to explicit L2 aptitude. Regarding L2 reading comprehension, results were less definitive. Comprehension was only significantly predicted by content-embedded WMC, but considering complex span WMC along with content-embedded WMC did not show better fit than a single-predictor model. While this can still suggest a stronger link between L2 reading comprehension and content-embedded WMC relative to complex span WMC, results should be considered more tentatively, as we further discuss later on.
Overall, this set of findings may be interpreted in connection with the functional requirements of each task type. It can be argued that, relative to complex span tasks, content-embedded tasks can offer “a more direct measure of an individual’s ability to maintain information in [working memory] that is relevant to the cognitive process being performed” (Was et al., Reference Was, Rawson, Bailey and Dunlosky2011, p. 914). Whereas complex span tasks require storing information (e.g., letters) active in working memory while completing an unrelated processing component (e.g., solving math equations), content-embedded tasks require continuous manipulation of task-relevant information that is maintained in memory. In this regard, it is possible to argue that the demands for successful performance in the ABCD and digit tasks more closely approximate the nature of simultaneous processing and storage required in common explicit L2 aptitude tests, as well as in other complex cognitive tasks that are ubiquitous in intentional learning conditions. Under such conditions, learners are typically prompted to consciously infer or figure out links between various linguistic elements, and such processes can be assumed to rely heavily on their capacity for the temporary storage and manipulation of task-relevant data (see Granena, Reference Granena, Granena and Long2013, Reference Granena2014; Kormos, Reference Kormos, Granena and Long2013; Wen, Reference Wen2016).
Study outcomes expand previous research investigating connections between learner individual differences in WMC and explicit L2 aptitude (e.g., Sáfár & Kormos, Reference Sáfár and Kormos2008; Roehr & Gánem-Gutiérrez, Reference Roehr and Gánem-Gutiérrez2009; Yalçin et al., Reference Yalçin, Çeçen and Erçetin2016) as well as L2 reading comprehension (e.g., Jeon & Yamashita, Reference Jeon and Yamashita2014; Leeser, Reference Leeser2007; Sagarra, Reference Sagarra2017), with methodological implications for future work beyond these lines of inquiry. Findings highlight that complex span tasks that tap into “the ability to maintain information that is irrelevant to the information being processed in the working memory system” (Zamary et al., Reference Zamary, Rawson and Was2019, p. 2547), which have been prioritized in L2 research to date, may be less predictive of explicit L2 aptitude outcomes than content-embedded tasks, and possibly of certain L2 reading comprehension measures. These results also add to the evidence supporting the benefits of content-embedded tasks over complex span tasks for predicting several higher-level cognitive functions, including L1 reading comprehension (Was et al., Reference Was, Rawson, Bailey and Dunlosky2011) and inductive reasoning (Zamary et al., Reference Zamary, Rawson and Was2019).
Following Was et al.’s proposal (2011), the more robust connection observed between the content-embedded WMC composite and explicit L2 aptitude—as well as L2 reading comprehension, albeit to a lesser degree—may also suggest that this variable is reflective of participants’ general cognitive capacity to a greater extent than the complex span WMC composite. The RSpan and the OSpan have highly comparable designs, with both tasks similarly structured to tap into individuals’ ability to store memory elements during interference or distraction. In contrast, the storage and processing demands of the ABCD and digit tasks are coordinated slightly differently across modalities. While both of these tasks require participants to encode, maintain, and manipulate information in their working memory, there are differences in the way verbal or numerical information is presented and how the processing demand is intertwined across trials in each task. In this respect, it is possible that content-embedded tasks can better complement each other methodologically, while still tapping into the same construct, potentially capturing a wider breadth of working memory performance.
From a methodological standpoint, the insights provided by the supplemental analyses, which aimed to rule out alternative artifact-related reasons to explain results, also warrant some discussion. Findings indicate that the stronger link between the criterion measures and content-embedded WMC remains even when individual differences in both storage and processing performance are considered when scoring the complex span tasks, in line with Zamary et al. (Reference Zamary, Rawson and Was2019). This suggests that the predictive strength observed here for content-embedded tasks, particularly with regard to explicit L2 aptitude, may not be accounted for by claiming that they are a better measure of an individual’s ability to store and process information during a cognitive task. The most plausible explanation comes back to the aforementioned functional differences across task types with respect to the maintenance of task-relevant information in the interdependent storage and processing components.
For L2 reading comprehension, a slightly different picture emerged, since the supplemental models showed that both the new RSpan and the content-embedded WMC composites were significant predictors. Since the RSpan requires some basic knowledge and skills in the verbal domain, it is reasonable to expect this test to predict verbal outcome measures better than tests with a numerical processing component, such as the OSpan, due to the shared dependence on general verbal ability (see, e.g., Juffs & Harrington, Reference Juffs and Harrington2011; Shin & Wu, Reference Shin, Hu, Schwieter and Wen2022). In the present study, the close alignment between the cognitive and linguistic demands of the L2 reading comprehension test and the verbal processing component of the RSpan may have contributed to some of its predictive effects observed in the supplemental analyses. Nonetheless, any inferences from the L2 reading comprehension data should be drawn carefully because, unlike the explicit L2 aptitude data, which were elicited using four distinct tests with a considerable number of items each, the reading comprehension data were derived from a single type of test containing a small number of items.
In sum, the findings of this study have ramifications that align with our general aim of exploring the potential value of content-embedded tasks for advancing cognitively-oriented SLA research and theory. From a methodological standpoint, our results suggest that for L2 researchers interested in understanding how WMC is associated with explicit L2 aptitude, and possibly L2 reading comprehension, considering content-embedded tasks as part of their design can be worthwhile—either as a way to understand participant profiles or within an individual-differences approach. Additionally, there are practical reasons that can make the inclusion of these tasks advantageous for researchers, such as their fast and uncomplicated administration, similar to the abridged versions of the complex span tasks. From a theoretical perspective, going forward, the use of these tasks in conjunction with complex span tasks may allow researchers to delve deeper into the cognitive mechanisms underlying intentional L2 learning. This, in turn, can contribute to the development of more comprehensive models of SLA that account for the interplay between WMC and different L2 learning processes and outcomes.
Lastly, we wish to highlight that our intention is not to discourage the continued use of complex span tasks, or to diminish their theoretical relevance and robustness for SLA research. Quite the contrary, their utility is undoubtedly evidenced in decades of empirical research, which have demonstrated the superiority of these tests over traditional short-term memory span tests (i.e., tests requiring only storage of target memory elements) for explaining linguistic outcomes (see, e.g., Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson and Doughty2013). Certainly, both complex span and content-embedded tasks tap into individual differences in the limited-capacity attentional resources that are broadly relevant for engaging in intentional L2 learning. Initial evidence from this study suggests that considering learners’ performance in content-embedded tasks can provide additional unique value for cognitively-oriented L2 research, at least in certain cases, and that the increased adoption of these tasks alongside complex span tasks can, ultimately, contribute to enhancing the explanatory and predictive scope of WMC in SLA.
Limitations and conclusions
Findings from this study should be interpreted considering its limitations. While the initial evidence presented here highlights the potential of content-embedded tasks, further research utilizing a wide range of L1 and L2 measures is necessary to support more fine-grained claims about the functional significance of these tests for advancing SLA research and theory. As previously mentioned, the current study employed one type of L2 reading comprehension assessment which included a small set of test items; thus, future work may wish to consider more comprehensive test formats that can broadly capture the diverse processes involved in L2 reading comprehension. Additionally, our sample of university students was relatively modest in size. Future research studies with larger sample sizes would allow for a more complex examination of the relationships among cognitive and linguistic variables through advanced statistical techniques, such as structural equation modeling, which can also account for measurement error. Studies considering process-level data in conjunction with outcome-level data would also be fruitful in understanding the mechanisms underlying WMC effects as measured by different task types.
To conclude, this study contributes to recent discussions on the measurement of WMC in L2 studies (e.g., Shin & Hu, Reference Shin, Hu, Schwieter and Wen2022; Wen et al., Reference Wen, Juffs, Winke, Winke and Brunfaut2021). While primarily focused on methodology, the findings also hold theoretical relevance. The demands of both complex span and content-embedded tasks are in accordance with the notion of working memory as a workspace for the processing and storage of information. However, it may be argued that content-embedded tasks better align with the core requirements associated with certain intentional L2 learning processes. One possibility is that successful L2 outcomes in conscious and effortful learning and processing conditions are not just predicted by one’s overall WMC for managing and storing information; for a range of L2 outcomes, which need to be further investigated, the emphasis may lie on the coordination of task-relevant information within a limited capacity system as an important individual difference underpinning variability.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0272263125000154.
Data availability statement
The experiment in this article earned Open Material badge for transparent practices. The data are available at https://url.avanan.click/v2/r02/ and https://www.iris-database.org/details/iv6nR-HD9NQ.
Acknowledgments
Financial support for this research was provided by the College of Liberal Arts at Temple University. We would like to extend our sincere gratitude to Christopher Was and Amanda Zamary for their invaluable assistance with the content-embedded tasks. We are grateful to Nick Pandža for statistical advice, and to Philip Hamrick and Christopher Was, again, for valuable discussions about this work. We also thank Alex Alarcón, Molly Clark, Alberto Fernández del Valle, Daniel Guarín, Georgia Kikis, Josh Pongan, and Coral Zayas-Colón for their help with various aspects of this research. Finally, we thank the editors and anonymous reviewers at SSLA for their helpful feedback on earlier versions of this manuscript. Any errors are our own.
Competing interest
The authors declare none.