INTRODUCTION
Understanding the relationship between implicit (unconscious) learning and knowledge is fundamental to second language acquisition (SLA) theory and pedagogy. In recent years, researchers have turned to measures of language aptitude (an individual’s ability to learn language) to better understand the nature of the different types of linguistic knowledge. Results have shown that explicit aptitude predicts the knowledge that results from explicit instruction (Li, Reference Li2015, Reference Li2016; Skehan, Reference Skehan2015); however, evidence for the effects of implicit-statistical learning aptitude on implicit knowledge has been limited in the field of SLA (compare Granena, Reference Granena2013; Suzuki & DeKeyser, Reference Suzuki and DeKeyser2017). In this project, we address two questions related to implicit-statistical learning aptitude and second language (L2) knowledge: (1) whether implicit-statistical learning aptitude is a componential mechanism (convergent validity) and (2) the extent to which different types of implicit-statistical learning tasks predict implicit knowledge (predictive validity). We expand the number of implicit-statistical learning aptitude measures beyond serial reaction time to obtain a more comprehensive assessment of learners’ implicit-statistical aptitude. Alongside, we will administer a battery of linguistic knowledge tests designed to measure explicit and implicit L2 knowledge. By doing so, we are able to examine how implicit-statistical learning aptitude predicts the development of implicit L2 knowledge.
IMPLICIT-STATISTICAL LEARNING APTITUDE
Implicit-statistical learning denotes one’s ability to pick up regularities in the environment (Frost et al., Reference Frost, Armstrong and Christiansen2019).Footnote 1 Learners with greater implicit-statistical learning aptitude, for instance, can segment word boundaries (statistical learning) and detect regularities in artificial languages (implicit language learning) better than those with lower implicit-statistical learning ability (for a comprehensive review of the unified framework of implicit-statistical learning, see Christiansen, Reference Christiansen2019; Conway & Christiansen, Reference Conway and Christiansen2006; Perruchet & Pacton, Reference Perruchet and Pacton2006; Rebuschat & Monaghan, Reference Rebuschat and Monaghan2019). This process of implicit-statistical learning is presumed to take place incidentally, without instructions to learn or the conscious intention on the part of the learner to do so.
Traditionally, implicit-statistical learning ability has been conceptualized as a unified construct where learning from different modes, such as vision, audition, and sense of touch, is interrelated and there is a common implicit-statistical learning mechanism governing the extraction of patterns across different modes of input. Recently, however, a growing body of research has evidenced that implicit-statistical learning may operate differently in different modalities and stimuli, yet still be subserved by domain-general computational principles (for reviews, see Arciuli, Reference Arciuli2017; Frost et al., Reference Frost, Armstrong, Siegelman and Christiansen2015; Siegelman et al., Reference Siegelman, Bogaerts, Christiansen and Frost2017a). In this view, implicit-statistical learning is modality and stimulus constrained (as the encoding of the information in different modalities relies on different parts of our body and different cortices) but this modality specific information is subject to domain-general processing principles that invoke shared brain regions. Implicit-statistical learning is thus modality specific at the level of encoding while also obeying domain-general computational principles at a more abstract level. If implicit-statistical learning is a componential ability, it follows that a more comprehensive approach to measurement is needed that brings together different tasks tapping into different components of implicit-statistical learning. Our first aim, accordingly, is to test the convergent validity of implicit-statistical learning measures by assessing the interrelationships between different measures of implicit-statistical learning. Doing so will inform measurement and help illuminate the theoretical construct of implicit-statistical learning.
In SLA, researchers have relied on different measures to capture implicit learning, statistical learning, and the related construct of procedural memory (see Appendix S1 in online Supplementary Materials).Footnote 2 For instance, implicit learning aptitude has been measured with the LLAMA D test of phonemic coding ability (Granena, Reference Granena2013, Reference Granena2019; Yi, Reference Yi2018), the serial reaction time task (Granena, Reference Granena2013, Reference Granena2019; Hamrick, Reference Hamrick2015; Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson, Smith, Bunting and Doughty2013; Suzuki & DeKeyser, Reference Suzuki and DeKeyser2015, Reference Suzuki and DeKeyser2017; Tagarelli et al., Reference Tagarelli, Ruiz, Vega and Rebuschat2016; Yi, Reference Yi2018), and the alternating serial reaction time (ASRT) task (Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018; Tagarelli et al., Reference Tagarelli, Ruiz, Vega and Rebuschat2016). The ASRT task doubles as a measure of procedural memory (Buffington & Morgan-Short, Reference Buffington, Morgan-Short, Kalish, Rau, Zhu and Rogers2018; Buffington et al., Reference Buffington, Demos and Morgan-Short2021; Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018; Hamrick, Reference Hamrick2015). Other measures of procedural memory are the Tower of London (TOL) (Antoniou et al., Reference Antoniou, Ettlinger and Wong2016; Ettlinger et al., Reference Ettlinger, Bradlow and Wong2014; Morgan-Short et al., Reference Morgan-Short, Faretta-Stutenberg, Brill-Schuetz, Carpenter and Wong2014; Pili-Moss et al., Reference Pili-Moss, Brill–Schuetz, Faretta-Stutenberg and Morgan-Short2019; Suzuki, Reference Suzuki2017) and the Weather Prediction Task (Faretta-Stutenberg & Morgan-Short, Reference Faretta-Stutenberg and Morgan-Short2018; Morgan-Short et al., Reference Morgan-Short, Faretta-Stutenberg, Brill-Schuetz, Carpenter and Wong2014; Pili-Moss et al., Reference Pili-Moss, Brill–Schuetz, Faretta-Stutenberg and Morgan-Short2019). Lastly, statistical learning has only been measured in the auditory modality in L2 research to date, with different tests of verbal auditory statistical learning (Brooks & Kempe, Reference Brooks and Kempe2013; McDonough & Trofimovich, Reference McDonough and Trofimovich2016; Misyak & Christiansen, Reference Misyak and Christiansen2012).
These different measures can provide insight into the nature of the learning processes that individuals draw on in different language learning tasks. Specifically, when performance on the linguistic task and the aptitude measure share variance, a common cognitive process (i.e., implicit-statistical learning or procedural memory) can be assumed to guide performance on both tasks. To illustrate, Yi (Reference Yi2018) found that native English speakers’ performance on a serial reaction time task predicted (i.e., shared variance with) their phrasal acceptability judgment speed. A similar association for L2 speakers between their explicit aptitude and phrasal acceptability judgment accuracy led the author to conclude that L1 speakers process collocations implicitly and L2 speakers process them more explicitly.
Although the use of implicit-statistical learning aptitude measures in L2 research is rising, there is a need to justify the use of these measures from a theoretical and a psychometric perspective more strongly. The possibility that implicit-statistical learning may not be a unitary construct highlights the need to motivate the choice of specific aptitude measure(s) and examine their construct validity, with due consideration of the measures’ input modality (Frost et al., Reference Frost, Armstrong, Siegelman and Christiansen2015). The questions of convergent validity (correlation with related measures) and divergent validity (dissociation from unrelated measures) have implications for measurement as well as SLA theory. Indeed, if implicit-statistical learning aptitude is to fulfill its promise as a cognitive variable that can explain the learning mechanisms that operate in different L2/foreign language contexts, for different target structures, and for learners of different L2 proficiency levels, valid and reliable measurement will be paramount.
In recent years, some researchers have begun to examine the construct validity of implicit-statistical learning aptitude measures by exploring their relationship to implicit memory (Granena, Reference Granena2019), procedural memory (Buffington et al., Reference Buffington, Demos and Morgan-Short2021; Buffington & Morgan-Short, Reference Buffington, Morgan-Short, Kalish, Rau, Zhu and Rogers2018), and working memory and explicit learning aptitude (Yi, Reference Yi2018). For measures of implicit learning aptitude, Granena (Reference Granena2019) found that the serial reaction time task loaded onto a different factor than the LLAMA D in an exploratory factor analysis (EFA), suggesting the two measures did not converge. Similarly, Yi (Reference Yi2018) reported that the serial reaction time task and LLAMA D were uncorrelated and the reliability of LLAMA D was low. In a study combining measures of implicit learning aptitude and procedural memory, Buffington et al. (Reference Buffington, Demos and Morgan-Short2021) also observed a lack of convergent validity between the ASRT, the Weather Prediction Task, and the TOL. These results do not support a unitary view of implicit-statistical learning aptitude or procedural memory. Furthermore, this research is yet to include measures of statistical learning as another approach to the same phenomenon (Christiansen, Reference Christiansen2019; Conway & Christiansen, Reference Conway and Christiansen2006; Perruchet & Pacton, Reference Perruchet and Pacton2006; Reber, Reference Reber and Rebuschat2015; Rebuschat & Monaghan, Reference Monaghan and Rebuschat2019). More research is needed to advance our understanding of these important issues.
With this study, we aim to advance this research agenda. We consider multiple dimensions of implicit-statistical learning aptitude, their reliabilities, and interrelationships (convergent validity). Of the various measures used as implicit-statistical learning aptitude in SLA and cognitive psychology, we included measures that represent different modes of input streams: visual statistical learning (VSL) for visual input, auditory statistical learning (ASL) for aural input, and ASRT for motor and visual input. In addition, we included the TOL task in recognition of its wide use in SLA research as a measure of procedural memory along with the ASRT task.
IMPLICIT, AUTOMATIZED EXPLICIT, AND EXPLICIT KNOWLEDGE
It is widely believed that language users possess at least two types of linguistic knowledge: explicit and implicit. Explicit knowledge is conscious and verbalizable knowledge of forms and regularities in the language that can be acquired through instruction. Implicit knowledge is tacit and unconscious linguistic knowledge that is gained mainly through exposure to rich input, and therefore cannot be easily taught. A third type of knowledge, automatized explicit knowledge, denotes explicit knowledge that language users are able to use rapidly, in time-pressured contexts, as a result of their extensive practice with the language. While the use of (nonautomatized) explicit knowledge tends to be slow and effortful, both implicit and automatized explicit knowledge can be deployed rapidly, with little or no conscious effort, during spontaneous communication (DeKeyser, Reference DeKeyser, Doughty and Long2003; Ellis, Reference Ellis2005). Consequently, it has been argued that implicit and automatized explicit knowledge are “functionally equivalent” (DeKeyser, Reference DeKeyser, Doughty and Long2003), in that it may be impossible to discern between the two in practice.
In a landmark study, Ellis (Reference Ellis2005) proposed a set of criteria to guide the design of tests that could provide relatively separate measures of explicit and implicit knowledge. Using principal component analysis, Ellis showed that time-pressured grammar tests that invite a focus on meaning (content creation) or form (linguistic accuracy) loaded onto one component (i.e., an oral production [OP] task, elicited imitation [EI], and a timed grammaticality judgment test [GJT]), which Ellis termed implicit knowledge. Untimed grammar tests that focus learners’ attention on form (i.e., ungrammatical items on an untimed GJT and a metalinguistic knowledge test [MKT]) loaded onto a different component, which Ellis labeled explicit knowledge (see Ellis & Loewen, Reference Ellis and Loewen2007, for a replication of these findings with confirmatory factor analysis). Subsequent studies using factor analysis on similar batteries of language tests also uncovered at least two dimensions of linguistic knowledge, termed explicit and implicit, which was largely consistent with Ellis’s initial results (e.g., Bowles, Reference Bowles2011; Kim & Nam, Reference Kim and Nam2017; Spada et al., Reference Spada, Shiu and Tomita2015; Zhang, Reference Zhang2015; but see Gutiérrez, Reference Gutiérrez2013).
The advent of reaction-time measures, however, invited new scrutiny of the construct validity of traditional measures of implicit knowledge such as the EI task and the timed written GJT (compare Ellis, Reference Ellis2005; Suzuki & DeKeyser, Reference Suzuki and DeKeyser2015, Reference Suzuki and DeKeyser2017; Vafaee et al., Reference Vafaee, Suzuki and Kachinske2017). The theoretical debate surrounding this issue was the distinction between implicit and automatized explicit knowledge, described previously, and whether, aside from differences in neural representation, the two types of knowledge can be differentiated behaviorally, in L2 learners’ language use. Departing from Ellis (Reference Ellis2005), researchers have hypothesized that timed, accuracy-based tests (e.g., EI) may be more suited to tap into learners’ automatized explicit knowledge because timed tests do not preclude learners from accessing their explicit knowledge, but merely make it more difficult for learners to do so (DeKeyser, Reference DeKeyser, Doughty and Long2003; Suzuki & DeKeyser, Reference Suzuki and DeKeyser2015). Reaction-time tests such as self-paced reading (SPR), however, require participants to process language in real time, as it unfolds, and could therefore hypothetically be more appropriate to capture learners’ implicit knowledge (Godfroid, Reference Godfroid2020; Suzuki & DeKeyser, Reference Suzuki and DeKeyser2015; Vafaee et al., Reference Vafaee, Suzuki and Kachinske2017). In the implicit-statistical learning literature, Christiansen (Reference Christiansen2019) similarly argued for the use of processing-based measures (e.g., reaction time tasks) over reflection-based tests (e.g., judgment tasks) to measure the effects of implicit-statistical learning. He did not, however, attribute differences in construct validity to them (i.e., both types of tests are assumed to measure largely implicit knowledge, but at different levels of sensitivity or completeness).
Using confirmatory factor analysis, Suzuki (Reference Suzuki2017) and Vafaee et al. (Reference Vafaee, Suzuki and Kachinske2017) confirmed that timed, accuracy-based tests and reaction-time tests represent different latent variables, which they interpreted as automatized explicit knowledge and implicit knowledge, respectively. The researchers did not include measures of (nonautomatized) explicit knowledge, however, which leaves the results open to alternative explanations. Specifically, for automatized explicit knowledge to be a practically meaningful construct, it needs to be distinguishable from implicit knowledge and (nonautomatized) explicit knowledge simultaneously, within the same statistical analysis. Doing so requires a more comprehensive approach to measurement, with tests of linguistic knowledge being sampled from across the whole explicit/automatized explicit/implicit knowledge spectrum. Hence, current evidence for the construct validity of reaction-time tasks as measures of implicit knowledge is still preliminary.
More generally, all the previous validation studies have included only a subset of commonly used explicit/implicit knowledge tests in SLA, which limits the generalizability of findings. Differences in test batteries may explain the conflicting findings for tests such as the timed written GJT (see Godfroid et al., Reference Godfroid, Loewen, Jung, Park, Gass and Ellis2015). This is because the results of confirmatory factor analysis are based on variance-covariance patterns for the tests included in the analysis and hence different test combinations may give rise to different statistical solutions. To obtain a more comprehensive picture, Godfroid et al. (Reference Godfroid, Kim, Hui and Isbell2018) synthesized 12 years of test validation research since Ellis (Reference Ellis2005) by including all previously used measures in one study—the word monitoring test (WMT), SPR, EI, OP, timed/untimed GJTs in the aural and written modes, and the MKT. The results suggested that both a three-factor model (EI and timed written GJT as “automatized explicit knowledge”; Suzuki & DeKeyser, Reference Suzuki and DeKeyser2015, Reference Suzuki and DeKeyser2017) and a two-factor model (EI and timed written GJT as “implicit knowledge”; Ellis, Reference Ellis2005) provided a good fit for the data and that the two models did not differ significantly. These results support the viability of a three-way distinction between explicit, automatized explicit, and implicit knowledge. As with all factor analytic research, however, the nature of the latent constructs was left to the researchers’ interpretation. Other sources of validity evidence, such as different patterns of aptitude-knowledge associations, examined here, could support the proposed interpretation and bolster the case for the distinction between implicit and automatized explicit knowledge.
CONTRIBUTIONS OF IMPLICIT-STATISTICAL LEARNING APTITUDE TO IMPLICIT KNOWLEDGE
Three studies to date have examined aptitude-knowledge associations in advanced L2 speakers, with a focus on measurement validity. We will review each study in detail because of their relevance to the current research. Granena (Reference Granena2013) compared Spanish L1 and Chinese-Spanish bilinguals’ performance on measures of explicit and implicit knowledge, using both agreement and nonagreement structures in Spanish. The participants had acquired Spanish either from birth, early in life, or postpuberty. Granena wanted to know whether the participants’ starting age impacted the cognitive processes they drew on for language learning. She found that early and late bilinguals’ performance on agreement structures correlated with their implicit-statistical learning aptitude, as measured by a serial reaction time task (early learners) or LLAMA D (late learners). These results suggested that bilinguals who do not acquire the language from birth may still draw on implicit-statistical learning mechanisms, albeit to a lesser extent than native speakers do; hence, the bilinguals’ greater sensitivity to individual differences in implicit-statistical learning aptitude compared to native speakers.
Suzuki and DeKeyser (Reference Suzuki and DeKeyser2015) compared the construct validity of EI and the WMT as measures of implicit knowledge. L1 Chinese-L2 Japanese participants performed an EI test with a built-in monitoring task. They were asked to listen to and repeat sentences, as is commonly done in an EI test, but in addition, they were asked to monitor the spoken sentences for a given target word (i.e., built-in word monitoring). The researchers found that performance on the two test components correlated with different criterion variables; specifically, EI correlated with performance on a MKT (a measure of explicit knowledge), whereas the WMT correlated with performance on the serial reaction time task (a measure of implicit-statistical learning aptitude), albeit only in a subgroup of participants who had lived in Japan for at least 2.5 years. Based on these results, the authors concluded that the WMT is a measure of implicit linguistic knowledge, whereas the EI test (traditionally considered a measure of implicit knowledge as well) is best considered a measure of automatized explicit knowledge.
In a follow-up study, Suzuki and DeKeyser (Reference Suzuki and DeKeyser2017) examined the relationships among implicit knowledge, automatized explicit knowledge, implicit-statistical learning aptitude, explicit learning aptitude, and short-term memory. Different from Granena (Reference Granena2013) and Suzuki and DeKeyser (Reference Suzuki and DeKeyser2015), the researchers found no significant association between serial reaction time (a measure of implicit-statistical learning aptitude) and either implicit or automatized explicit knowledge. Rather, they found that advanced Japanese L2 students’ performance on LLAMA F (a measure of explicit learning aptitude) predicted their automatized explicit knowledge. The authors also tested the explanatory value of adding a knowledge interface (i.e., a directional path) between automatized explicit and implicit knowledge in the structural equation model (SEM). This path was indeed significant, meaning automatized explicit knowledge predicted implicit knowledge, but the interface model as a whole was not significantly different from a noninterface model that did not include such a path. The researchers interpreted their results as evidence that automatized explicit knowledge directly impacts the acquisition of implicit knowledge (through the interface), and that explicit learning aptitude indirectly facilitated the development of implicit knowledge. Thus, in their study no direct predictors of implicit knowledge were found.
Taken together, Granena (Reference Granena2013) and Suzuki and DeKeyser (Reference Suzuki and DeKeyser2015) found a positive correlation between implicit knowledge test scores (i.e., sensitivity on a WMT) and implicit-statistical learning aptitude, in line with the view that the WMT, a reaction-time measure, may index implicit knowledge. In Suzuki and DeKeyser’s (Reference Suzuki and DeKeyser2017) SEM, however, the same implicit-statistical learning aptitude test had no association with the implicit knowledge construct, which was composed of three reaction-time measures, including a WMT (incidentally, none of the three reaction-time measures loaded onto the implicit knowledge factor significantly, which may have signaled a problem with these measures or with the assumption that they were measuring implicit knowledge). Critically, the three studies have only used a very limited set of implicit-statistical learning aptitude measures (serial reaction time and, in Granena’s study, LLAMA D) that examine implicit-statistical motor learning and phonetic coding ability, respectively. Given that the implicit-statistical learning construct is modality specific (i.e., implicit-statistical learning can occur in visual, aural, and motor modes), the limited range of implicit-statistical learning aptitude tests in these studies limits the generalizability of the results to the tests with which they were obtained. Another issue concerns the low reliability of aptitude and knowledge measures obtained from reaction time data (Draheim et al., Reference Draheim, Mashburn, Martin and Engle2019; Rouder & Haaf, Reference Rouder and Haaf2019), which may obscure any aptitude–knowledge relationships. In recognition of these gaps, we included a battery of four implicit-statistical learning aptitude tests (VSL, ASL, ASRT, and TOL) in order to examine the predictive validity of implicit-statistical learning aptitude for implicit, automatized explicit, and explicit L2 knowledge.
RESEARCH QUESTIONS
In this study, we triangulate performance on a battery of nine linguistic knowledge tests with data from four measures of implicit-statistical learning aptitude with an aim to validate a new and extended set of measures of implicit, automatized explicit, and explicit knowledge. The following research questions guided the study:
-
1. Convergent validity of implicit-statistical learning aptitude:
To what extent do different measures of implicit-statistical learning aptitude interrelate?
-
2. Predictive validity of implicit-statistical learning aptitude:
To what extent do measures of implicit-statistical learning aptitude predict three distinct dimensions of linguistic knowledge, referred to as explicit knowledge, automatized explicit knowledge, and implicit knowledge?
METHOD
PARTICIPANTS
Participants were 131 nonnative English speakers (Female = 69, Male = 51, Not reported = 11) who were pursuing academic degrees at a large Midwestern university in the United States. The final sample was obtained after excluding 26 participants who completed only one out of the four aptitude tests. Nearly half of the participants were native speakers of Chinese (n = 66). The remaining participants’ L1s included Korean, Spanish, Arabic, Russian, Urdu, Malay, Turkish, and French, among others. The participants’ average length of residence in an English-speaking country was 41 months (SD = 27.21, range 2–200 months). The participants were highly proficient English speakers with an average TOEFL score of 96.00 (SD = 8.80). Their mean age was 24 years (SD = 4.64) and their average age of arrival in the United States was 20 years old (SD = 5.68). They received $50 as compensation for their time.
TARGET STRUCTURES
The target structures included six grammatical features: (1) third-person singular -s, (2) mass/count nouns, (3) comparatives, (4) embedded questions, (5) be passive, and (6) verb complement. We selected these three syntactic (4–6) and three morphological (1–3) structures to measure a range of English grammar knowledge (see Table 1 for examples). These structures emerge in different stages of L2 acquisition (e.g., Ellis, Reference Ellis, Ellis, Loewen, Elder, Erlam, Philp and Reinders2009) and thus were deemed appropriate to represent English morphosyntax.
Note: Critical region in boldface and underlined.
INSTRUMENTS
We administered nine linguistic tests of L2 grammar knowledge: WMT, SPR, OP, Timed Aural Grammaticality Judgment Test (TAGJT), Timed Written Grammaticality Judgment Test (TWGJT), EI, Untimed Aural Grammaticality Judgment Test (UAGJT), Untimed Written Grammaticality Judgment Test (UWGJT), and MKT. Based on previous literature, it was hypothesized that these tests represented either the (1) implicit, automatized explicit, and explicit knowledge constructs (i.e., an extension of the Suzuki and DeKeyser [Reference Suzuki and DeKeyser2017] model) or the (2) implicit and explicit knowledge constructs (i.e., the Ellis [Reference Ellis2005] model). We also administered four implicit-statistical learning aptitude tests: VSL, ASL, ASRT, and TOL. Table 2 summarizes the characteristics of the nine linguistic and four aptitude tests.
LINGUISTIC TESTS
Word Monitoring Task
The WMT is a dual processing task that combines listening comprehension and word-monitoring task demands. Participants first saw a content word (e.g., reading), designated as the target for word monitoring. They were instructed to press a button immediately as they heard the word in a spoken sentence (e.g., The old woman enjoys reading many different famous novels). Importantly, the monitor word was always preceded by one of the six linguistic structures in either a grammatical (e.g., enjoys) or ungrammatical (e.g., enjoy) form. Exhibiting grammatical sensitivity—that is, slower reaction times on content words when the prior word is ungrammatical than when it is grammatical—indicated knowledge of the grammatical target structure.
Self-Paced Reading
In the SPR task, participants read a sentence word-by-word in a self-paced fashion. They progressed to the next word in a sentence by pressing a button. As with the WMT, participants read grammatical and ungrammatical sentences. Evidence for linguistic knowledge was based on grammatical sensitivity—that is, slower reaction times to the ungrammatical version than the matched, grammatical version of the same sentence. In particular, we analyzed reaction times for the spillover region (i.e., the word or words immediately following the critical region) for each sentence and created a difference score for the ungrammatical and grammatical sentences.
Oral Production
The OP task was a speaking test where participants had to retell a picture-cued short story that contained multiple tokens of the six target structures. After reading the story two times without a time limit, participants had to retell the story in as much detail as possible in two and a half minutes. The percentage of correct usage of each target structure in all obligatory occasions of use (i.e., obligatory contexts) was used as a dependent variable. Obligatory contexts were defined relative to the participants’ own production. Two coders independently coded OP. The reliability of interrater coding (Pearson r) was .96.
Elicited Imitation
Similar to the OP task, the EI was a speaking test where participants were asked to listen to a sentence, judge the semantic plausibility of the sentence, and repeat the sentence in correct English. No explicit instructions directed participants to correct the erroneous part of the sentence. Following Erlam’s (Reference Erlam2006) scoring system, correct usage in obligatory context was used for analysis.
Grammaticality Judgment Tests
In the GJTs, participants either read or listened to a sentence in a timed or an untimed test condition. The participants were instructed to determine the grammaticality of the sentence. The time limit for each sentence in the timed written and the timed aural GJT was set based on the length of the audio stimuli in the aural GJT. We computed the median audio length of sentences with the same number of words and added 50%. This resulted in a time limit of 4.12 seconds for a seven-word sentence and up to 5.7 seconds for a 14-word sentence for the timed GJTs. Two sets of sentences were created and counterbalanced for grammaticality and each set was rotated between the four tests, resulting in eight sets of sentences in total. In each of the four GJTs (timed written, untimed written, timed aural, untimed aural), one point was given per accurate judgment.
Metalinguistic Knowledge Test
The MKT required participants to read 12 sentences that contained a grammatical error. Their task was to (1) identify the error, (2) correct the error, and (3) explain in as much detail as possible why it was ungrammatical. We only scored the error correction and explanation parts of the test; as such, a total of two points were given per question. The maximum score was 24 and the total score was converted to a percentage. See Appendix S2 in online Supplementary Materials for the scoring rubric.
IMPLICIT-STATISTICAL LEARNING APTITUDE TESTS
ASRT
The ASRT (Howard & Howard, Reference Howard and Howard1997) was used to measure implicit-statistical learning aptitude. In the ASRT, participants viewed four empty circles in the middle of a computer screen that would fill as a black circle one at a time. The sequence of the filled circles followed a pattern that alternated with random (nonpatterned) trials, creating a second-order relationship (e.g., 2r4r3r1r, where r denotes a random position). Participants were instructed to press the corresponding key on a keyboard that mirrored the position of the filled circle as quickly and accurately as possible. To capture learning, we calculated the change in reaction time to pattern trials from block 1 to block 10 and subtracted the change in reaction time to random trials from block 1 to block 10. Positive values indicate a greater improvement in sequence learning over the course of the task.
Auditory Statistical Learning
The ASL (Siegelman et al., Reference Siegelman, Bogaerts, Elazar, Arciuli and Frost2018, experiment 1b) served as another implicit-statistical learning task. In the ASL test, participants heard 16 nonverbal, familiar sounds that were randomly organized into eight triplets (sequences of three sounds). Four triplets had a transitional probability of one (i.e., they were fixed) and four triplets had a transitional probability of .33 (i.e., every sound was followed by one of three other sounds, with equal likelihood). Each triplet was repeated 24 times during a continuous familiarization stream. Participants were asked to listen to the input very carefully as they would be tested on it after the training. The test consisted of 42 trials: 34 four-alternative forced-choice questions measuring recognition of triplets and eight pattern completion trials measuring recall. Performance on the test yielded an accuracy percentage score.
Visual Statistical Learning
The VSL (Siegelman et al., Reference Siegelman, Bogaerts, Christiansen and Frost2017b) was used to measure learners’ ability to learn visual patterns implicitly. As the visual counterpart of the ASL, the VSL presented participants with 16 complex visual shapes that were difficult to describe verbally and were randomly organized into eight triplets (sequences of three shapes). The triplets had a transitional probability of one. Each triplet was repeated 24 times during the familiarization phase. In the testing phase, participants completed 42 trials: 34 four-alternative forced-choice items measuring recognition of triplets and eight pattern completion trials measuring recall. Performance on the test yielded an accuracy percentage score.
Tower of London
The TOL (Kaller et al., Reference Kaller, Rahm, Köstering and Unterrainer2011) was administered to measure learners’ implicit-statistical learning ability during nonroutine planning tasks. Participants were presented with two spatial configurations that consisted of three pegs with colored balls on them. These configurations were labeled as “Start” or “Goal.” The participants’ task was to move the colored balls on the pegs in the “Start” configuration to match the “Goal” configuration in the given number of moves. There was a block of four 3-move trials, eight 4-move trials, eight 5-move trials, and eight 6-move trials (Morgan-Short et al., Reference Morgan-Short, Faretta-Stutenberg, Brill-Schuetz, Carpenter and Wong2014). We will present the results for overall solution time in what follows, which is the sum of initial thinking time and movement execution time. All three measures yielded similar results. To capture learning, we calculated a proportional change score for each block of trials (i.e., 3-move, 4-move, 5-move, and 6-move separately) for each participant using the following computation: (RT on the first trial − RT on the final trial)/RT on the first trial. Positive values indicate a greater improvement in planning ability from the beginning to the end of each block in the experiment.
PROCEDURE
Participants met with a trained research assistant for three separate sessions. As seen in Table 3, the first session included the WMT, SPR, timed aural GJT, and untimed aural GJT; the second session started with OP followed by EI, the timed written GJT, untimed written GJT, and MKT; in the last session, participants completed all aptitude tests starting with VSL, and ended with the MLAT 5 (which is not discussed in this article). Sessions 1 and 2 started with the more implicit knowledge measures to minimize the possibility of participants becoming aware of the target features in the implicit tasks.
DATA ANALYSIS
Descriptive Statistics and Correlations
Overall, 6% of the data were missing and they were missing completely at random (Little’s MCAR test: χ2 = 1642.159, df = 1744, p = .960). To explore the associations among measures of implicit-statistical learning aptitude and between implicit-statistical learning aptitude and linguistic knowledge, respectively, we calculated descriptive statistics and Spearman correlations (abbreviated as rs in what follows) for all measures of L2 morphosyntactic knowledge and cognitive aptitude. All such analyses were carried out in R version 1.2.1335 (R Core Team, Reference Core Team2018).
Factor Analysis
To address research question 1, “to what extent do different measures of implicit-statistical learning aptitude interrelate (convergent validity)?,” we conducted an EFA to explore the association between the four implicit-statistical learning aptitude measures. The EFA was performed with an oblique rotation (oblimin) that permits factors to correlate with each other. The model was computed using weighted least squares to account for the violation of multivariate normality assumption for the four tests (Mardia’s skewness coefficient was 36.93 with a p-value of 0.012; Mardia’s kurtosis coefficient was 2.28 with a p-value of 0.023). Finally, we used a factor loading cutoff criterion of .40 to interpret the factor loadings.
To address research question 2, “to what extent do measures of implicit-statistical learning aptitude predict three distinct dimensions of linguistic knowledge (i.e., explicit knowledge, automatized explicit knowledge, and implicit knowledge (predictive validity)?,” we built confirmatory factor analysis (CFA) and SEM models using the lavaan package in R. To examine the psychometric dimensions underlying the nine linguistic tests, we constructed two CFA models, a two-factor and a three-factor model. These models were specified based on theory and previous empirical findings from CFA studies by Ellis (Reference Ellis2005) and Suzuki and DeKeyser (Reference Suzuki and DeKeyser2017). To evaluate the CFA models, we used a model test statistic (chi-square test), standardized residuals (<|1.96|) and three model fit indices (Hu & Bentler, Reference Hu and Bentler1999): the comparative fit index (CFI => .96), the root mean square error of approximation (root mean square error of association [RMSEA] =< .06), and the standardized root mean square residual (standardized root mean square [SRMR] =< .09). We then built a SEM. In combination with a measurement model (CFA), SEM estimates the directional effects of independent variables (measures of implicit-statistical learning aptitude) on the latent dependent variables (the knowledge type constructs). Full-information maximum likelihood estimation was used to evaluate different models and Robust Maximum Likelihood was adopted as an estimation method for both the CFA and SEM analyses to account for the violation of multivariate normality assumption.
RESULTS
DESCRIPTIVE STATISTICS
Table 4 shows the descriptive statistics for all linguistic and aptitude measures. Participants showed a wide range of abilities in their performance on the linguistic knowledge measures. Reliabilities of the individual differences measures ranged from satisfactory to high and were generally on a par with those reported in previous studies: ASRT intraclass correlation coefficient (ICC) = .96 (this study) and ASRT ICC = .99 (Buffington & Morgan-Short, Reference Buffington, Morgan-Short, Kalish, Rau, Zhu and Rogers2018), VSL α = .75 (this study) and VSL α = .88 (Siegelman et al., Reference Siegelman, Bogaerts, Christiansen and Frost2017b), ASL α = .68 (this study) and ASL α = .73 (Siegelman et al., Reference Siegelman, Bogaerts, Elazar, Arciuli and Frost2018, Experiment 1b), and TOL ICC = .78 (this study) and TOL split-half reliability = .59 (Buffington & Morgan-Short, Reference Buffington, Morgan-Short, Kalish, Rau, Zhu and Rogers2018).
Abbreviations: ASL, Auditory Statistical Learning; ASRT, Alternating Serial Reaction Time; EI, elicited imitation; MKT, metalinguistic knowledge test; OP, oral production; SPR, self-paced reading; TAGJT, Timed Aural Grammaticality Judgment Test; TWGJT, Timed Written Grammaticality Judgment Test; TOL, Tower of London; UAGJT, Untimed Aural Grammaticality Judgment Test; UWGJT, Untimed Written Grammaticality Judgment Test; VSL, Visual Statistical Learning; WMT, word monitoring test.
Notes: z, standardized score; Δ, difference score; msec, milliseconds.
a Four items with negative item-total correlation were excluded from reliability analysis and from the final dataset.
b The reliability was computed separately for Random and Pattern trials and both were above .957.
c Pearson r intercoder correlation.
d The reliability of EI is an average score of two versions.
e The reliability scores of the four GJTs are an average score of the structure-level reliability of eight versions.
f Total correct on the EI was rescaled by a factor of .10, yielding a total score out of 2.4.
RESEARCH QUESTION 1: CONVERGENT VALIDITY OF IMPLICIT-STATISTICAL LEARNING APTITUDE: TO WHAT EXTENT DO DIFFERENT MEASURES OF IMPLICIT-STATISTICAL LEARNING APTITUDE INTERRELATE?
Correlational Analysis Among Aptitude Measures
To examine the unidimensionality of implicit-statistical learning aptitude and the interrelationships between different aptitude measures, we ran a correlation matrix between the four implicit-statistical learning aptitude measures. Table 5 presents the Spearman correlation matrix of ASRT, VSL, ASL, and TOL. We note a medium correlation between the VSL and ASL tasks (rs = .492, p < .001). At the same time, correlations of the ASRT and TOL with other tasks are low (−.146 ≤ rs ≤ .054). These results suggest that ASL and VSL may tap into a common underlying ability, statistical learning, whereas performance on other measures of implicit-statistical learning aptitude was essentially unrelated. In sum, the correlation analysis provides initial evidence for the lack of convergent validity of measures of implicit-statistical learning aptitude.
Abbreviations: ASL, Auditory Statistical Learning; ASRT, Alternating Serial Reaction Time; TOL, Tower of London; VSL, Visual Statistical Learning.
Note: ***p < .001.
Exploratory Factor Analysis
As the second and final step in answering research question 1, we conducted an EFA with the same four measures. The Kaiser–Meyer–Olkin (KMO) measure suggested that, at the group-level, the sampling for the analysis was close to the minimum KMO of .50 (KMO = .49). At an individual test level, most tests were near the .50 cutoff point (ASRT = .52; VSL = .49; ASL = .49) with TOL reaching a bit short (.43). Despite the low KMO, we decided to keep all measures in the analysis because they were theoretically motivated. Bartlett’s test of sphericity, χ²(6) = 31.367, p < .001, indicated that the correlations between tests were sufficiently large for an EFA. Using an eigenvalue cutoff of 1.0, there were three factors that explained a cumulative variance of 72% (the third factor accounted for a substantial increase in the explained variance, that is, 22%, and was thus included even though the eigenvalue was slightly short of 1.0). Table 6 details the factor loadings post rotation using a factor criterion of .40. As can be seen in Table 6, factor 1 represents motor sequence learning (ASRT), factor 2 represents procedural memory (TOL), and the last factor represents statistical learning, with VSL and ASL loading together.
Abbreviations: ASL, Auditory Statistical Learning; ASRT, Alternating Serial Reaction Time; TOL, Tower of London; VSL, Visual Statistical Learning.
Note: Bold values indicate loadings above 0.40.
RESEARCH QUESTION 2: PREDICTIVE VALIDITY OF IMPLICIT-STATISTICAL LEARNING APTITUDE: TO WHAT EXTENT DO MEASURES OF IMPLICIT-STATISTICAL LEARNING APTITUDE PREDICT EXPLICIT, AUTOMATIZED EXPLICIT, AND IMPLICIT KNOWLEDGE?
Confirmatory Factor Analysis
To address the second research question, we first constructed measurement models as a part of SEM to examine the number of dimensions in the nine linguistic tests. As seen in Table 2, we specified two CFA models based on SLA theory: a two-factor model distinguishing implicit versus explicit knowledge (Ellis, Reference Ellis2005) and a three-factor model distinguishing implicit versus automatized explicit versus explicit knowledge (an extension of Suzuki & DeKeyser, Reference Suzuki and DeKeyser2017). The models differed critically with regard to whether the reaction-time tasks (WMT, SPR) and the timed, accuracy-based measures (OP, EI, TAGJT, TWGJT) loaded onto the same factor, “implicit knowledge,” in the two-factor solution, or different factors, “implicit knowledge” and “automatized explicit knowledge,” in the three-factor solution (see Table 2).
The summary of the fit indices for the measurement models in Table 7 suggests that both models fit the data well, meeting general guidelines by Hu and Bentler (Reference Hu and Bentler1999). At the same time, the two-factor model demonstrates a better fit than the three-factor model with a Bayesian information criterion value smaller than the three-factor model (∆BIC ranging between 2 and 6 denotes a positive difference in favor of the model with the lower BIC; see Kass & Raftery, Reference Kass and Raftery1995).
Abbreviations: CFI, comparative fit index; RMSEA, root mean square error of approximation; SRMR, standardized root mean square.
Correlational Analysis of Aptitude Measures and Knowledge Measures
Before running the SEM, we first explored the correlations between implicit-statistical learning aptitude and linguistic knowledge measures. Figure 1 contains Spearman correlation coefficients (above the diagonal) of the 13 variables, scatterplots for variable pairs (below the diagonal), and density plots for each variable (on the diagonal). The results suggest that ASRT correlated significantly and positively with the WMT (rs = .335, p = .002) and the TWGJT (rs = .229, p = .024). In contrast, VSL (rs = −.341, p = .001) and, to a lesser extent, ASL (rs = −.184, p = .095) correlated negatively with the WMT. TOL did not correlate significantly with any of the linguistic knowledge measures (−.128 ≤ rs ≤ .069).
Structural Equation Model
As the final step in answering research question 2, we fitted the structural model to the measurement model to examine aptitude-knowledge relationships. In light of the EFA findings where VSL and ASL clustered into a single factor, we built a latent predictor variable called statistical learning (SL), which combined the ASL and VSL. Consequently, we retained three measures of implicit-statistical learning aptitude (SL, TOL, and ASRT) and treated these as predictor variables of different knowledge constructs to examine the aptitude-knowledge relationships. In the measurement model, we allowed for the different knowledge constructs (i.e., explicit, automatized explicit, and implicit knowledge) to correlate because they represent different subcomponents of language proficiency and thus we assumed that they would be related. Figures 2 and 3 show the results of the analyses.Footnote 3
Table 8 details model fit indices for the two-factor and three-factor SEM models. Two out of the four global fit indices, namely the chi-square test and CFI, fell short of the cutoff points proposed by Hu and Bentler (Reference Hu and Bentler1999); the SRMR was slightly above the .09 threshold. To diagnose any sources of model misspecification, we inspected the modification indices and standardized residuals. In the two-factor SEM model, two modification indices were larger than 3.84, signaling localized areas of potential ill fits. Both modification indices concerned the WMT, which had a low factor loading onto implicit knowledge. The modifications were not implemented, however, as they lacked theoretical justifications (i.e., one recommended WMT as an explicit measure [MI = 4.89] and another suggested WMT as a SL measure [MI = 4.65]). No standardized residual for any of the indicators was greater than |1.96| (largest = 1.73). In the three-factor model, 12 modification indices were larger than 3.84. Based on this information, we modified the model by adding a method effect (error covariance) between EI and OP to account for the fact that EI and OP are both production tasks. Other modification indices lacked a theoretical or methodological underpinning and, hence, were not pursued further. As detailed in Table 8, adding the error covariance changed the global fit of the modified three-factor model mostly positively (i.e., chi-square p value: 0.02 → 0.03; CFI: .843 → .863; lower bound RMSEA: 0.028 → 0.019) but also negatively (i.e., SRMR: 0.094 → 0.095). No standardized residual for any of the variables was greater than |1.96|; however, the standardized residual for the WMT-ASRT covariance was slightly above the threshold (Std. residual = 1.97), indicating an area of local strain.
Abbreviations: CFI, comparative fit index; RMSEA, root mean square error of approximation; SRMR, standardized root mean square.
Taken together, the two-factor model exhibited a better local fit than the three-factor model, which suggested that it represented our data best. Global fit indices were somewhat low, possibly due to sample size limitations, but importantly, the underlying measurement models (CFA) demonstrated a good fit (see Table 7). As such, we proceeded to interpret the parameter estimates of the two-factor model.
Table 9 and Figure 2 detail parameter estimates for the two-factor SEM model. As seen in Table 9, the regression path from ASRT to implicit knowledge was significant, r = .258, p = .007. None of the other aptitude measures were significantly predicting ability in implicit or explicit knowledge in the model.
Abbreviations: ASRT, alternating serial reaction time; SL, statistical learning; TOL, Tower of London.
DISCUSSION
SUMMARY OF RESULTS
We aimed to contribute to the theorization of implicit-statistical learning aptitude as an individual differences variable that may be of special importance for attaining an advanced L2 proficiency (Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson, Smith, Bunting and Doughty2013). To measure implicit-statistical learning aptitude more comprehensively, we included two new measures—ASL and VSL (Siegelman et al., Reference Siegelman, Bogaerts, Christiansen and Frost2017b, Reference Siegelman, Bogaerts, Elazar, Arciuli and Frost2018)—to the better-known measures of ASRT and TOL. Overall, only ASL and VSL showed a medium-strong correlation (r = .49) and loaded onto the same factor, whereas the remaining measures were not correlated (RQ1). This underlines that implicit-statistical learning aptitude is a multidimensional, multifaceted construct and that input modality is an important facet of the construct. A multitest approach, measuring aptitude in different input streams and task conditions, is best suited to ensure its predictive validity for language learning.
Given the theoretical importance of implicit-statistical learning aptitude, we also examined its predictive validity for implicit language knowledge, using a battery of nine L2 grammar tests. The final SEM consisted of three aptitude measures regressed on a two-factor measurement model—explicit and implicit knowledge. We found that only ASRT predicted implicit knowledge, which was a latent variable composed of timed, accuracy-based measures and reaction-time tasks. These results inform ongoing debates about the nature of implicit knowledge in SLA (Ellis, Reference Ellis2005; Suzuki & DeKeyser, Reference Suzuki and DeKeyser2017) and do not lend support to the view that reaction time measures are inherently superior for measuring L2 speakers’ implicit knowledge (Suzuki & DeKeyser, Reference Suzuki and DeKeyser2015; Vafaee et al., Reference Vafaee, Suzuki and Kachinske2017).
MULTIDIMENSIONAL NATURE OF IMPLICIT-STATISTICAL LEARNING APTITUDE (RQ1)
Research on implicit-statistical learning aptitude can be traced back to different research traditions within cognitive and developmental psychology (Christiansen, Reference Christiansen2019). The domain-general mechanisms that enable implicit-statistical learning have been linked to a range of different linguistic behaviors—from speech segmentation and vocabulary acquisition, to syntactic processing and literacy development (see Armstrong et al., Reference Armstrong, Frost and Christiansen2017; Monaghan & Rebuschat, Reference Monaghan and Rebuschat2019, for recent theoretical discussions). Given the explanatory power of implicit-statistical learning aptitude in language research, we first examined the convergent validity of different measures used to assess learners’ aptitude.
The results of our EFA did not support the unidimensionality of the different implicit-statistical learning aptitude measures (see Table 6). At a descriptive level, bivariate correlations between the different aptitude measures were close to 0, with the exception of ASL and VSL, which showed a .49 correlation. Correspondingly, in the EFA, the three-factor solution indicated that the battery of aptitude tests does not represent a unitary construct of implicit-statistical learning aptitude. Three factors were extracted: Factor 1 [ASRT] = .25; Factor 2 [TOL] = .24; Factor 3 [ASL and VSL] = .22, which together accounted for 72% of the total variance.
The medium strength correlation between the measures of statistical learning replicated Siegelman et al. (Reference Siegelman, Bogaerts, Elazar, Arciuli and Frost2018, experiment 2), who reported a .55 correlation between the ASL and VSL. The ASL and VSL are similar in terms of the nature of the embedded statistical regularity, length of training, and the way statistical learning is assessed (Siegelman et al., Reference Siegelman, Bogaerts, Christiansen and Frost2017a, Reference Siegelman, Bogaerts, Elazar, Arciuli and Frost2018). Given that the tests are similar other than with regard to their input modality, these measures jointly offer a relatively pure test of the role of input modality in statistical learning. The results of the EFA showed that a common underlying ability, statistical learning, accounted for approximately 22% of the variance in participants’ ASL and VSL performance, while differences in input modality accounted for some portion of the remaining 78% of variance. Input modality is therefore likely to be an important source of individual differences in statistical learning (Frost et al., Reference Frost, Armstrong, Siegelman and Christiansen2015). These modality-based differences in statistical learning aptitude are relevant to adult L2 learners insofar as learners experience a mix of written and spoken input that may shift according to their instructed or naturalistic learning environments. For instance, Kim and Godfroid (Reference Kim and Godfroid2019, experiment 2) reported an advantage for visual over auditory input in the L2 acquisition of implicit knowledge of syntax by college-educated adults. While results of correlation research are best interpreted cumulatively, across different research studies, the medium-strong ASL-VSL correlation in the present study is consistent with the view (Arciuli, Reference Arciuli2017; Frost et al., Reference Frost, Armstrong, Siegelman and Christiansen2015; Siegelman et al., Reference Siegelman, Bogaerts, Christiansen and Frost2017b) that statistical learning is a domain-general process that is not uniform across modalities.
Seen in this light, it is interesting that the other assessment of statistical learning in the visual modality, the ASRT, showed no correlation with the VSL (see Table 5). Both tests use nonverbal material to assess an individual’s ability to extract transitional probabilities from visual input. The ASRT has an added motor component, which may have contributed to the lack of convergence between the two measures. Additionally, VSL and ASRT may not have correlated because of when learning was assessed. Learning on the ASRT was tracked online, during the task, as a reaction time improvement (speed up) over training. In the ASL and VSL, however, assessment of learning took place offline, in a separate multiple-choice test that came after the training phase. It has been argued that the conscious reflection involved in offline tasks may confound the largely implicit learning that characterizes statistical learning (Christiansen, Reference Christiansen2019). Online measures of implicit-statistical learning such as the ASRT, however, may be able to capture learning with a higher resolution (Siegelman et al., Reference Siegelman, Bogaerts, Christiansen and Frost2017b) and a better signal-to-noise ratio. Although more research is needed to evaluate these claims, our results support the superiority of online measurement. Using structural equation modeling, we confirmed the predictive validity of the ASRT for predicting implicit grammar knowledge in a sample of advanced L2 speakers (see the next section on RQ2 for further discussion). Conversely, the VSL or ASL did not have predictive validity for L2 implicit grammar knowledge in this study, potentially because the two measures of statistical learning allowed for participants’ conscious involvement on the posttests. To investigate this result in more depth, researchers could reexamine the predictive validity of the ASL and VSL for specific grammar structures in our test battery such as embedded questions or third-person -s, which contain a clear, patterned regularity that lends itself well to statistical learning.
Lastly, the ASL, VSL, and ASRT were unrelated to the TOL. The TOL task finds its origin in research on planning and executive function (Shallice, Reference Shallice1982) and was used in a modified form, as a measure of cognitive skill learning, in Ouellet et al. (Reference Ouellet, Beauchamp, Owen and Doyon2004). Because TOL measures the effects of practice, it can be regarded as a measure of skill acquisition (Ouellet et al., Reference Ouellet, Beauchamp, Owen and Doyon2004) and is assumed to reflect procedural learning (Ouellet et al., Reference Ouellet, Beauchamp, Owen and Doyon2004) and provide a measure of individual differences in procedural memory ability (e.g., Antoniou et al., Reference Antoniou, Ettlinger and Wong2016; Buffington & Morgan-Short, Reference Buffington, Morgan-Short, Kalish, Rau, Zhu and Rogers2018; Buffington et al., Reference Buffington, Demos and Morgan-Short2021; Ettlinger et al., Reference Ettlinger, Bradlow and Wong2014; Morgan-Short et al., Reference Morgan-Short, Faretta-Stutenberg, Brill-Schuetz, Carpenter and Wong2014). The contributions of procedural memory to implicit-statistical learning are complex (Batterink et al., Reference Batterink, Paller and Reber2019; Williams, Reference Williams2020). Batterink et al. (Reference Batterink, Paller and Reber2019) reported that “a common theme that emerges across implicit learning and statistical learning paradigms is that there is frequently interaction or competition between the declarative and nondeclarative [e.g., procedural] memory systems of the brain…. Even in paradigms that have been specifically designed to isolate ‘implicit learning’ per se, healthy learners completing these tasks may show behavioral evidence of having acquired both declarative and nondeclarative memory” (p. 485, our addition in brackets). This interaction between declarative and nondeclarative memory in implicit learning tasks could explain the lack of convergent validity between TOL and the other measures of implicit-statistical learning aptitude; that is, measures of implicit-statistical learning may draw on multiple memory systems including, but not limited to, procedural memory. Our results are consistent with Buffington and Morgan-Short (Reference Buffington, Morgan-Short, Kalish, Rau, Zhu and Rogers2018) and Buffington et al. (Reference Buffington, Demos and Morgan-Short2021), who also reported a lack of correlation between the ASRT and TOL in two samples of university-level research participants (r = −.03, n = 27 and r = .03, n = 99).
The TOL does not involve patterned stimuli like the other three measures in this study, but focuses instead on an individual’s improvement (accuracy gains or speed up) in solving spatial problems as a result of practice. The lack of predictive validity for implicit knowledge in advanced L2 speakers creates a need for further research into the learning processes and memory systems engaged by the TOL. TOL is indeed measuring practice, but our results, in addition to those of Buffington and colleagues (Reference Buffington, Demos and Morgan-Short2021), do not support the claim that such practice effects reflect an individual’s procedural memory learning ability. Further research into the construct validity of the TOL will be necessary. To facilitate future validation efforts, it would be helpful to standardize the use of the TOL task in L2 research. Multiple task versions, with and without repeating trials, as well as with accuracy scores versus with reaction times, are currently used in parallel in SLA, which renders comparisons of results across studies difficult (compare Antoniou et al., Reference Antoniou, Ettlinger and Wong2016; Buffington & Morgan-Short, Reference Buffington, Morgan-Short, Kalish, Rau, Zhu and Rogers2018; Ettlinger et al., Reference Ettlinger, Bradlow and Wong2014, who used a task version with repeating trials, with Morgan-Short et al., Reference Morgan-Short, Faretta-Stutenberg, Brill-Schuetz, Carpenter and Wong2014; Pili-Moss et al., Reference Pili-Moss, Brill–Schuetz, Faretta-Stutenberg and Morgan-Short2019; Suzuki, Reference Suzuki2017, who used a non-repeating version of the task). To this point, Kaller and colleagues (Reference Kaller, Debelak, Köstering, Egle, Rahm, Wild, Blettner, Beutel and Unterrainer2016) published the TOL-F, an accuracy-based version of the TOL with improved psychometric properties that is still new in L2 research but could be of great value to achieve greater standardization in the field.
On balance, our results suggest that the findings for implicit-statistical learning aptitude do not generalize beyond the measure with which they were obtained. Future researchers will therefore need to continue treating different tests of implicit-statistical learning aptitude as noninterchangeable. For maximum generalizability, it will be important to continue using a multitest approach as exemplified in the present study. Including multiple tests of implicit-statistical learning aptitude will ensure proper representation of the substantive domain and may help researchers steer clear of confirmation bias. Over time, it will also enable researchers to refine their understanding of the different dimensions of implicit-statistical learning aptitude (Siegelman et al., Reference Siegelman, Bogaerts, Christiansen and Frost2017a) and come to a more nuanced understanding of these dimensions’ roles, or nonroles, in different L2 learning environments, for learners of different ages and education levels, and with different target structures. Our call for a multitest approach echoes common practice in explicit learning aptitude research, where researchers routinely administer a battery of different tests to language learners to measure their aptitudes (see Kalra et al., Reference Kalra, Gabrieli and Finn2019; Li, Reference Li2015, Reference Li2016).
ONLY TIMED, ACCURACY-BASED TESTS SUPPORTED AS MEASURES OF IMPLICIT KNOWLEDGE (RQ2)
This study was conducted against the background of an ongoing debate about how best to measure L2 learners’ implicit knowledge. Measures of implicit-statistical learning aptitude can inform the construct validity of different tests—timed, accuracy-based tests and reaction time tasks—by revealing associations of aptitude with these hypothetical measures of implicit knowledge (DeKeyser, Reference DeKeyser2012; Granena, Reference Granena2013). The results of this study support the predictive validity of implicit-statistical learning aptitude (ASRT) for performance on timed language tests, affirming the validity of timed, accuracy-based tests as measures of implicit knowledge (Ellis, Reference Ellis2005). Similar support for the validity of reaction-time–based tests was lacking (cf. Suzuki & DeKeyser, Reference Suzuki and DeKeyser2017), which emphasized that our understanding of reaction-time measures of linguistic knowledge is still at an early stage.
We find these results to be intriguing. The two reaction-time tasks in the study, WMT and SPR, rely on the same mechanism of grammatical sensitivity (i.e., slower responses to ungrammatical than grammatical sentences) to capture an individual’s linguistic knowledge. It has been assumed, often without much challenge, that grammatical sensitivity on reaction-time tests operates outside the participants’ awareness, and hence may represent the participants’ linguistic competence or implicit knowledge (for a critical discussion of this assumption, see Godfroid, Reference Godfroid2020; Marsden et al., Reference Marsden, Thompson and Plonsky2018). But in spite of the underlying similarity between the two tasks, performance on the SPR and the WMT correlated weakly, rs = .178, p = .098 (see Figure 1), and the two tasks loaded poorly onto the implicit knowledge factor in the CFA/SEM analysis (SPR, Std. Est. = 0.225; WMT, Std. Est. = 0.054). This indicates that current models of L2 linguistic knowledge do not account well for participants’ performance on reaction-time tasks.
The construct validity of reaction time measures of linguistic knowledge cannot be separated from the instrument reliability. Compared to the accuracy-based tasks in the study, learners’ performance on the WMT and SPR (the two reaction time tasks) was somewhat less reliable (see Table 4 for a comprehensive review on the validity and reliability of the nine linguistic measures). This has been a fairly consistent observation for reaction time measures, and in particular reaction time difference measures used in individual differences research (e.g., Draheim et al., Reference Draheim, Mashburn, Martin and Engle2019; Hedge et al., Reference Hedge, Powell and Sumner2018; Rouder & Haaf, Reference Rouder and Haaf2019), such as the grammatical sensitivity scores calculated for SPR and WMT in this study. Draheim et al. (Reference Draheim, Mashburn, Martin and Engle2019) pointed out that researchers who work with reaction time difference measures often see one task “dominate” a factor, with other measures loading poorly onto the same factor. This is exactly what happened in the three-factor SEM model, where the implicit knowledge factor accounted perfectly for participants’ SPR performance, but did not explain much variance in WMT scores. The three-factor model was abandoned for a simpler, two-factor SEM model, but that model did not account well for either reaction-time measure (see Figure 2 and Appendix S3 in online Supplementary Materials). These results suggest that reaction-time tests of linguistic knowledge are not a homogeneous whole (either inherently or because of lack of internal consistency), in spite of their shared methodological features. Therefore, given the current state of affairs, claims about their construct validity ought to be refined to the level of individual tests, for instance WMT or SPR separately, rather than reaction time measures as a whole.
To illustrate, we performed a post-hoc correlation analysis of the ASRT with WMT and SPR separately. We found that the ASRT correlated significantly and positively with the WMT (Spearman rank, rs = .335, p = .002), mirroring the global result for implicit knowledge (i.e., the latent variable, which was also predicted by the ASRT). SPR did not correlate with the ASRT (rs = −.027, p = .804) or with other measures of implicit-statistical learning aptitude. These results suggest that at the individual-test level, the WMT has some characteristics of a measure of implicit knowledge, consistent with earlier findings from Granena (Reference Granena2013) and Suzuki and DeKeyser (Reference Suzuki and DeKeyser2015). No such evidence for SPR was obtained in this study.
Last but not least, our results revealed a significant association between implicit-statistical learning aptitude (the ASRT) and a latent factor that included four timed, accuracy-based tests (TWGJT, TAGJT, EI, OP). This supported the validity of these measures as implicit knowledge tests (Ellis, Reference Ellis2005). Successful performance on the timed, accuracy-based measures requires fast and accurate processing of targeted grammatical knowledge. The ASRT, however, is an entirely nonlinguistic (nonverbal) task that requires fast and accurate motor responses from participants. To obtain a high aptitude score on the ASRT, participants need to speed up over time as they induce the repeating patterns in the motor sequence. One possible account for the ASRT-implicit knowledge relationship, therefore, is that both measures rely on participants’ procedural memory (also see Buffington et al., Reference Buffington, Demos and Morgan-Short2021). On this account, the ASRT derives its validity as a predictor of implicit knowledge because it taps into the same neural substrate as implicit knowledge of language does, namely procedural memory. Similarly to procedural memory representations, implicit knowledge takes time to develop. This may explain why in previous studies, as in the present one, the SRT and ASRT predicted performance in proficient or near-native L2 learners (Granena, Reference Granena2013; Linck et al., Reference Linck, Hughes, Campbell, Silbert, Tare, Jackson, Smith, Bunting and Doughty2013; Suzuki & DeKeyser, Reference Suzuki and DeKeyser2015; but see Suzuki & DeKeyser, Reference Suzuki and DeKeyser2017; Tagarelli et al., Reference Tagarelli, Ruiz, Vega and Rebuschat2016) or predicted collocational knowledge in L1 speakers and not L2 speakers (Yi, Reference Yi2018). For researchers who may not have the resources to include multiple measures of implicit-statistical learning, the SRT or ASRT may thus be the best, single-test option to gain insight into the nature of learner processes or linguistic outcomes (also see Kaufman et al., Reference Kaufman, DeYoung, Gray, Jiménez, Brown and Mackintosh2010, who referred to the SRT as “the best measure of implicit learning currently available,” p. 325).
CONCLUSION
We examined the contributions of implicit-statistical learning aptitude to implicit L2 grammar knowledge. Our results are a part of an ongoing, interdisciplinary research effort, designed to uncover the role of domain-general mechanisms in first and second language acquisition. Implicit-statistical learning aptitude was found to differ along multiple dimensions, suggesting a need for caution when generalizing results from a specific test (e.g., ASRT) to the larger theoretical constructs of implicit learning, statistical learning, and procedural memory because results may be specific to the test with which they were obtained, and the theoretical constructs may not be unitary in nature.
We also adduced support for the validity of timed, accuracy-based knowledge tests (i.e., OP, EI, timed auditory/written GJTs) as measures of implicit knowledge, supporting their use in the language classroom, language assessment, and lab-based language research to assess implicit grammar knowledge. Reaction time measures (i.e., SPR, word monitoring) currently do not enjoy the same level of validity evidence, in spite of their widespread use in lab-based research.
Despite its contributions, this study had some limitations that must be considered when interpreting the results. First, our participants were highly heterogeneous in their L1s, language learning contexts, and length of residence in an English-speaking country. Nearly half of our participants were Chinese, who may have had a jagged profile of explicit and implicit knowledge. Differences in L1 background could invite possible transfer effects (both positive and negative) across the tasks and structures. This study would also have benefited from a larger sample size, both for the EFA and the SEM. Lastly, it will be crucial to establish a good test-retest reliability for the different measures of implicit-statistical learning aptitude in future research (see Kalra et al., Reference Kalra, Gabrieli and Finn2019; Siegelman & Frost, Reference Frost, Armstrong, Siegelman and Christiansen2015) to show that these aptitude measures can serve as stable individual differences measures that preserve rank order between individuals over time.
Nonetheless, the results of this study help reconcile different theoretical positions regarding the measurement of L2 implicit knowledge by affirming the validity of timed, accuracy-based tests. They also point to the validity and reliability of reaction-time measures as an important area for future research. We would very much welcome other researchers to advance this research agenda and hope that the test battery developed for this project will help contribute to this goal.
Supplementary Materials
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/S0272263121000085.