Introduction
Previous research investigating second language (L2) learners’ oral production within task-based language teaching (TBLT) has provided empirical support for the tenets of Robinson’s (Reference Robinson and Robinson2001a, Reference Robinson and Robinson2011) cognition hypothesis, namely that increasing cognitive task demands along resource-directing variables (e.g., causal reasoning) enhances learners’ attention to linguistic form and may therefore result in more complex language production and increased lexical and grammatical accuracy, often at the expense of speaking fluency (e.g., Gilabert, Barón & Llanes, Reference Gilabert and Mayo2007; Ishikawa, Reference Ishikawa2008). Such complexity manipulations are presumed to have a positive impact on L2 learners’ linguistic performance. An alternative hypothesis, namely Skehan’s limited attentional capacity model (LACM) (Reference Skehan2009, Reference Skehan and Bygate2015), suggests that the complexity and accuracy dimensions of oral production might compete for learners’ limited attentional resources, and may not both be attended to, resulting in complexity–accuracy trade-offs (e.g., Michel, Révész, Shi & Li, Reference Michel, Révész, Shi, Li, Zhen and Ahmadian2019; Sample & Michel, Reference Sample and Michel2015). However, none of these hypotheses have been sufficiently investigated in relation to L2 pronunciation. Pronunciation is, in fact, underrepresented in TBLT research (Gurzynski-Weiss, Long & Solon, Reference Gurzynski-Weiss, Long and Solon2017), which has primarily focused on the conceptualization and formulation stages of speech production, neglecting the phonological and phonetic aspects of pronunciation, which can be influenced by task complexity.
Positive effects of increased task complexity have been found for speech comprehensibility but not for accentedness (e.g., Crowther, Trofimovich, Saito & Isaacs, Reference Crowther, Trofimovich, Saito and Isaacs2018; Gordon, Reference Gordon2021), and for L2 pronunciation accuracy for a subset of L2 vowels (Mora-Plaza, Reference Mora-Plaza, Henderson and Kirkova-Naskova2023; Solon, Long & Gurzynski-Weiss, Reference Solon, Long and Gurzynski-Weiss2017). However, strong empirical evidence of the benefits of manipulating task complexity for enhancing attention to phonetic form is still lacking, whereas it is well-attested for lexical (Gilabert et al., Reference Gilabert, Barón and Llanes2009), grammatical (Révész, Reference Révész2009) and pragmatic (Márquez & Barón, Reference Márquez and Barón2021) form. According to Kormos’ (Reference Kormos1999, Reference Kormos2000) attention and monitoring model of speech processing, conceptualizing the message during online tasks may necessitate particular attention, leaving few attentional resources for lexical, semantic, and phonological encoding. In the case of L2 communicative tasks, learners may be forced to focus on lexical and grammatical aspects during speech production, making it difficult for them to pay attention to pronunciation due to the limited attentional resources available to them during self-monitoring (Kormos, Reference Kormos1999). In fact, lexical and grammatical self-repairs have been found to outnumber phonological repairs in purely meaning-oriented tasks (Kormos, Reference Kormos2000). The fact that most repairs are lexical in meaning-oriented tasks is because they carry most of the relevant information in the message and making errors may result in serious misunderstandings.
The present study aims to extend this line of research by investigating task complexity effects on L2 pronunciation accuracy in pronunciation-unfocused tasks. Pronunciation is an important component of language competence affecting listeners’ comprehensibility (i.e., ease of understanding) and facilitating effective communication. In addition, investigating task effects on L2 pronunciation accuracy will provide insights into the role of speaking tasks and task design in fostering learners’ pronunciation skills. Oral productions elicited from first language (L1)–Spanish advanced learners of English performing simple and complex versions of a problem-solving monologic speaking task were analyzed acoustically to obtain voice onset time (VOT) measures of laryngeal timing accuracy for L2 voiceless oral stops (/p/, /t/, /k/), and contrastiveness and nativelikeness for difficult L2 vowels (/iː/, /ɪ/, /æ/, /ᴧ/). In addition, native English native listeners’ (NL) judgments of comprehensibility and accentedness were obtained as global measures of L2 pronunciation accuracy.
TBLT: task design and manipulation
TBLT is an analytic approach to language acquisition in which learners are presented with holistic samples of language, which they are expected to analyze and infer the underlying rules by themselves. In such a process, directing learners’ attention toward accuracy while maintaining the communicative value of tasks is central to language development (Long, Reference Long2015). The use of a wide variety of pedagogical procedures to draw learners’ attention to linguistic form (see Sudharshana, Reference Sudharshana, Sudharshana and Mukhopadhyay2021, for a review) would enhance learners’ ability to refine and restructure their interlanguage.
In TBLT, tasks are conceived as real-world communicative activities requiring learners’ use of language (Van den Branden, Reference Van den Branden2006), hence, a meaning-driven work plan that learners have to accomplish by relying on their own linguistic and non-linguistic resources (Ellis, Reference Ellis2009). Tasks can be categorized as unfocused, aiming to offer learners opportunities for general communicative language use, or focused, intending to provide opportunities for communication using specific linguistic features (Ellis, Reference Ellis2009). Focused tasks have often been found to effectively direct learners’ attention to the use of the target linguistic features under focus. Manipulations of task design variables include task types (e.g., narrative/instruction-giving/decision-making; Gilabert et al., Reference Gilabert, Barón and Llanes2009) interlocutor proficiency (e.g., low/high; Kim & McDonough, Reference Kim and McDonough2008), task mode (e.g., online/face-to-face; Baralt, Gurzynski-Weiss & Kim, Reference Baralt, Gurzynski-Weiss, Kim, Sato and Ballinger2016) and task complexity (e.g., simple/complex; Révész, Reference Révész2009). Empirical research has found that such task manipulations can influence the linguistic complexity, accuracy, and/or fluency (CAF) of learners’ oral performance and the development of L2 linguistic accuracy. In particular, TBLT is considered an effective methodology for developing lexico-grammatical (Baralt et al., Reference Baralt, Gurzynski-Weiss, Kim, Sato and Ballinger2016) and pragmatic linguistic targets (Márquez & Barón, Reference Márquez and Barón2021). Nevertheless, very little attention has been paid to how unfocused communicative tasks might affect L2 pronunciation and to what extent task design and manipulation (i.e., task complexity) can effectively direct learners’ attention to linguistic targets beyond grammar, lexis, and pragmatics (Gurzynski-Weiss et al., Reference Gurzynski-Weiss, Long and Solon2017), such as pronunciation.
Task complexity and CAF in oral performance
With the aim of grading and sequencing tasks in a principled way in a task-based syllabus, L2 researchers proposed a set of criteria for evaluating the complexity of a task supported by theoretical frameworks, and conducted empirical studies to investigate whether task complexity on L2 production was predicted by those theories. First, Skehan’s (Reference Skehan2009, Reference Skehan and Bygate2015) LACM, founded on theories of working memory and speech production, conceptualizes attention as a single volume that runs out of resources (Kahneman, Reference Kahneman1973). Provided that human attentional resources are limited, Skehan believes that attention can only be allocated to certain aspects of performance to the detriment of others. Therefore, when task demands increase, learners first allocate attentional resources to the content of the task (i.e., fluency), and what remains is assigned to linguistic form (i.e., complexity and accuracy). If the content demands are extremely high, complexity and accuracy may compete for attention, and one may cause a negative impact on the other (e.g., Michel et al., Reference Michel, Révész, Shi, Li, Zhen and Ahmadian2019; Sample & Michel, Reference Sample and Michel2015). Skehan’s (Reference Skehan2009, Reference Skehan and Bygate2015) model is in accordance with Kormos’ (Reference Kormos1999, Reference Kormos2000) conceptualization of the role of attention in self-monitoring, in that both suggest that L2 production stages (i.e., conceptualization, formulation) may face a competition for cognitive resources, generating a potential trade-off between complexity and accuracy measures of L2 oral performance. Additionally, Kormos postulated that attentional limitations could limit the number and type of errors (e.g., lexis, grammatical, phonetic) noticed by the speaker and available for self-monitoring. Skehan suggested three factors contributing to the difficulty of the task, namely, code complexity, cognitive complexity and communicative stress, and other learner factors. Nevertheless, his model was unable to explain the phenomenon of dual-task performance and divided attention nor was concerned with how tasks should be sequenced to promote L2 learning outside the foreign language classroom (Robinson, Reference Robinson and Robinson2011).
An alternative strand of TBLT research attempting to manipulate learners’ attention is the work within the cognition hypothesis (Robinson, Reference Robinson and Robinson2011), grounded on information-processing theories, interactionist research, and psychological models such as Wickens’ (Reference Wickens and Holding1989) model of dual-task performance. Robinson’s (Reference Robinson and Robinson2011) cognition hypothesis claims that learners can simultaneously access multiple and noncompetitional resource pools, and predicts that the increase of cognitive demands of a task is likely to direct attentional and memory resources to linguistic features and therefore lead to greater L2 grammatical and lexical accuracy and complexity, as long as learners draw from different pools of attentional resources. In order to identify specific task factors that should be manipulated to make tasks more or less cognitively demanding, the triadic componential framework (Robinson, Reference Robinson and Robinson2001a; Robinson & Gilabert, Reference Robinson and Gilabert2007) distinguishes resource-directing from resource-dispersing dimensions. The former refers to those in which the demands on language use can be met by manipulating the manner in which the information is presented (e.g., ± few elements, ± reasoning). In contrast, the latter refers to those that mirror the processing conditions under which real-time language is often used (e.g., ± planning time, ± prior knowledge). Increasing task complexity along resource-directing dimensions may potentially direct cognitive resources to linguistic form, thus, leading to a greater accuracy and complexity in oral production, often at the expense of fluency (Gilabert, Reference Gilabert and Mayo2007; Ishikawa, Reference Ishikawa2008; Robinson, Reference Robinson2001b). In contrast, increasing task complexity along resource-dispersing dimensions could pose greater demands on attention and working memory, thus, depleting attention from the language code, which could be detrimental to L2 production.
Finally, increased task complexity often results in significantly higher ratings of task difficulty, mental effort, and anxiety while keeping task interest and motivation unaffected (Robinson, Reference Robinson2001b). Research has shown that task complexity manipulation may affect CAF measures differentially in speaking tasks. For example, Jackson and Suethanapornkul’s (Reference Jackson and Suethanapornkul2013) systematic review found nonsignificant task complexity effects for syntactic complexity (d = −0.02), small positive effects for accuracy (d = 0.28); and a negligible but positive effect for lexical complexity (d = 0.03), suggesting that increased task complexity led to larger lexical variety/diversity/density, at the expense of speaking fluency (d = -0.16), consistent with the cognition hypothesis. However, the relation between task complexity and L2 pronunciation remains largely unexplored. TBLT research has previously assessed L2 pronunciation as part of speaking fluency or lexical accuracy (Kim & McDonough, Reference Kim and McDonough2008) or in terms of pronunciation errors (Kuiken & Vedder, Reference Kuiken, Vedder and Robinson2011), but few studies have investigated it in relation to global dimensions of pronunciation (Gordon, Reference Gordon2021) or through acoustic analyses (Mora-Plaza, Reference Mora-Plaza, Henderson and Kirkova-Naskova2023; Solon et al., Reference Solon, Long and Gurzynski-Weiss2017). However, to our current knowledge, no studies to date have investigated to what extent the predictions of the cognition hypothesis hold for L2 pronunciation in pronunciation-unfocused tasks.
Task complexity and L2 pronunciation
One of the current discussions within the realm of L2 pronunciation instruction is whether task-based methodologies can promote attention to L1–L2 phonological differences and create opportunities for learners to acquire L2 sound contrasts and phonological features, and update the phonological form of their lexical representations. Reactive form-focused instructional techniques (e.g., negative feedback) and task design and manipulation (e.g., modality, repetition, complexity) have been found to lead to more accurate L2 pronunciation during communicative task performance (Gurzynski-Weiss et al., Reference Gurzynski-Weiss, Long and Solon2017). For example, Solon et al.’s (Reference Solon, Long and Gurzynski-Weiss2017) study revealed that L2 Spanish learners produced one out of five Spanish vowel monophthongs (/e/) with a more target-like quality (as assessed through acoustic analyses of formant frequencies) in the complex than the simple version of the task.
On the one hand, recent evidence suggests that, when tasks are designed to promote a focus-on-phonetic form (i.e., pronunciation-focused), task complexity positively impacts L2 pronunciation accuracy and subsequently leads to gains in L2 phonological development (e.g., Mora-Plaza, Reference Mora-Plaza, Henderson and Kirkova-Naskova2023). In the same vein, Mora-Plaza et al. (Reference Mora-Plaza, Mora, Gilabert, Levis, Nagle and Today2018) and Mora-Plaza (Reference Mora-Plaza, Henderson and Kirkova-Naskova2023) reported gains in the production of difficult English vowel contrasts for L1 Catalan learners, as measured through Euclidean and Mahalanobis distances, respectively, between L2 confusable vowels. Lastly, Gordon (Reference Gordon2021) found beginner-level English as a foreign language (EFL) learners assigned to a complex-decision-making task condition intervention to outperform those assigned to a simple-decision-making task condition in comprehensibility (but not in accentedness) after treatment. Together, these studies provide evidence of the potential of task complexity to draw learners’ attention to phonological form and improve pronunciation through a communicative form-focused intervention.
On the other hand, in pronunciation-unfocused tasks, it remains uncertain whether increasing task demands might have detrimental effects on L2 pronunciation. For example, Kuiken and Vedder (Reference Kuiken, Vedder and Robinson2011) investigated the influence of task complexity along ± reasoning demands on L2 performance as a function of mode (i.e., written/oral) and L2 proficiency (i.e., high/low). Although this study was not intentionally designed to assess the impact of task complexity on L2 segmental and suprasegmental speech features in the oral version of the task, descriptively, increased task complexity was found to increase lexical and grammatical accuracy but to decrease pronunciation accuracy, especially in the case of low proficiency L2 learners. These findings point to the possibility of a potential trade-off between lexico-grammatical and pronunciation, where more attentional resources might be allocated to the control of lexical and grammatical form than phonetic form (Derwing, Munro & Wiebe, Reference Derwing, Munro and Wiebe1998), in line with Skehan’s (Reference Skehan2009, Reference Skehan and Bygate2015) hypothesis. With a comparable cohort to the present study, Mora, Mora-Plaza & Bermejo Miranda’s (Reference Mora, Mora-Plaza and Bermejo Miranda2024) study revealed that learners produced significantly fewer lexico-grammatical errors in the complex version of a monologic oral task than the simple version (Gilabert, Reference Gilabert and Mayo2007; Robinson, Reference Robinson2001b), but the opposite pattern was found for pronunciation, with pronunciation errors being more frequent (though not to a significant extent) in the complex than the simple task. Increased attention to lexical and grammatical aspects during task performance may thus make it difficult for learners to allocate attentional resources to pronunciation. For instance, Crowther et al. (Reference Crowther, Trofimovich, Saito and Isaacs2018) assessed the extent to which comprehensibility and accentedness ratings were related to segmental and suprasegmental aspects of L2 speech in three tasks differing in cognitive complexity. One of the findings was that learners’ speech was rated as significantly more strongly accented (but not less comprehensible) in the complex than the simple task, although the effect sizes were relatively small. These findings would lend support to Kormos’ (Reference Kormos1999, Reference Kormos2000) attention and monitoring model of speech processing postulating that the demands of the task might determine the number of attentional resources available during self-monitoring. Consequently, learners might need to pay more attention to lexical and grammatical aspects of speech to successfully convey the message than to segmental or suprasegmental aspects. Kormos (Reference Kormos1999, Reference Kormos2000) claimed that grammatical and lexical slips of the tongue (i.e., a measure of lexico-grammatical accuracy) are more likely to be detected at a different time than phonological errors while accounting for interindividual variation in L2 proficiency.
Current study
To date, L2 acquisition and TBLT research has emphasized the well-established relation between task complexity and speech production and development. While most research studying the effects of increasing task demands has focused on learners’ grammatical, lexical, and pragmatic performance and development, much less attention has been given to L2 pronunciation and prosody. Therefore, it is necessary to understand whether L2 pronunciation can be attended to when the demands of a task place great strain on learners’ production processes, and learners need to invoke all linguistic resources available.
The primary aim of the current study was to test the predictions of the cognition hypothesis (Robinson, Reference Robinson and Robinson2001a, Reference Robinson and Robinson2011) on L2 pronunciation by manipulating task complexity (simple vs. complex) in a pronunciation-unfocused task. The choice of the cognition hypothesis and triadic componential framework (Robinson, Reference Robinson and Robinson2001a, Reference Robinson and Robinson2011) as the theoretical model for the conceptualization of this study was motivated by (1) the model’s comprehensive account of the dimensions of task complexity that may influence L2 performance (i.e., the role of reasoning demands as a resource-directing variable, and planning time and prior knowledge as a resource-dispersing variable); (2) the model’s predictions of task complexity regarding L2 oral development (as in Révész, Reference Révész2009); and (3) the comparability with previous pronunciation-focused (e.g., Gordon, Reference Gordon2021; Mora-Plaza, Reference Mora-Plaza, Henderson and Kirkova-Naskova2023) and pronunciation-unfocused (e.g., Kuiken & Vedder, Reference Kuiken, Vedder and Robinson2011) studies which methodologically manipulated task complexity along Robinson’s triadic componential framework and theoretically explained their findings in light of the cognition hypothesis.
First, we assessed learners’ L2 segmental accuracy (VOT in oral stops and degree of contrastiveness and nativelikeness in vowels) in a simple and complex version of a problem-solving task. Then we obtained NL’ ratings of comprehensibility and accentedness on learners’ speech sample excerpts from both task versions. Finally, in order to establish whether task effects were consistent across acoustic and global measures of pronunciation accuracy, and whether acoustically more accurate productions of oral stops (VOT) and vowels (Mahalanobis distances) predicted NL’ ratings of comprehensibility and accentedness, we assessed the relationship between acoustic and global measures in the simple and complex tasks. Accordingly, we formulated the following research questions and hypotheses:
RQ1: How does task complexity affect learners’ production of voice onset time (VOT) in word-initial stressed voiceless plosives (/p, t, k/)?
RQ2: Does task complexity have an effect on learners’ vowel production (/iː/-/ɪ/, /æ/-/ʌ/)?
RQ3: How does task complexity affect learners’ ratings of comprehensibility and accentedness?
RQ4: To what extent are acoustic and global measures of learners’ pronunciation related in the simple and the complex task?
In line with Kuiken and Vedder (Reference Kuiken, Vedder and Robinson2011) and Mora et al.’s (Reference Mora, Mora-Plaza and Bermejo Miranda2024) studies, complexifying the task along resource-directing dimensions (± reasoning) in pronunciation-unfocused tasks is predicted to draw learners’ attention away from phonological form, negatively affecting pronunciation accuracy (RQ1 and RQ2). This would lend support to Kormos (Reference Kormos2000) and Skehan’s (Reference Skehan2009, Reference Skehan and Bygate2015) predictions on potential trade-offs between areas of L2 oral performance, and would contradict Robinson’s (Reference Robinson and Robinson2011) cognition hypothesis. In line with the findings by Crowther et al. (Reference Crowther, Trofimovich, Saito and Isaacs2018), accentedness but not comprehensibility ratings (RQ3) may be affected by task complexity. Acoustic and global measures are hypothesized to be associated (RQ4) in both simple and complex tasks (Crowther et al., Reference Crowther, Trofimovich, Saito and Isaacs2018); especially foreign accent scores are expected to be moderately related to VOT productions (Riney & Takagi, Reference Riney and Takagi1999) and vowel accuracy (Munro, Reference Munro1993).
Method
Participants
Eighty-two undergraduate advanced EFL learners (see Table 1 for demographics) participated in the study for course credit (female = 70, male = 12). They were Catalan-Spanish bilinguals who had learned English as an L2 mainly through formal instruction at school since the age of five. In this bilingual context, learners varied in Catalan–Spanish dominance (12 Catalan-dominant, 23 Spanish-dominant, 42 balanced bilinguals), but this was not expected to affect their L2 pronunciation performance significantly because Spanish and Catalan speakers do not differ in their use of short-lag stops and share the vowel categories /i/ and /a/, the only high-front and low-central vowels in their vocalic system. Participants were randomly assigned to two groups differing in task order: simple (S)>complex (C) (N = 45) or C>S (N = 37).
a 1 = never, 2 = yearly, 3 = monthly, 4 = weekly, 5 = daily.
b Reading, listening, speaking, writing: 9-point Likert scale from 1 = very poor to 9 = native-like.
c Pronunciation only: 9-point Likert scale from 1 = very poor to 9 = native-like.
d Measured by a Yes/No vocabulary size test (Meara & Miralpeix, Reference Meara and Miralpeix2015).
e Measured by an elicited imitation task (Wu, Tio & Ortega, Reference Wu, Tio and Ortega2021).
Thirteen NL (six males, seven females, mean age = 32.9, SD = 7.7) were recruited to evaluate the L2 learners’ speech samples for comprehensibility and accentedness. They were experienced EFL teachers speaking either British (46%) or American (53%) English varieties. They reported being very familiar with Spanish/Catalan-accented English on a 9-point Likert scale (1 = “not familiar at all”; 9 = “very familiar”; M = 8.5, SD = 0.9).
An additional group of eight native speakers (NS) of Southern British English (three males, five females) were recruited to perform the same speaking tasks as learners to obtain baseline speech data. They were all EFL teachers (mean age = 41.1, SD = 12.0), who had had a predominantly monolingual upbringing and had lived in Spain for 16.3 years (SD = 12.6).
L2 oral narrative: the dinner table task
Learners performed a simple and a complex monologic version of the dinner table task (Ur, Reference Ur1981) in a recording booth on different days. In this task, learners had to decide on and justify the seating arrangement of several characters at different tables. To eliminate the potential confound of task sequence and complexity, half of the participants performed the tasks in a S>C order and half in the C>S order. Both the simple and the complex versions of the task involved four stages: 1) a listening pretask, 2) a speaking task part 1, 3) a speaking task part 2, and 4) a posttask questionnaire.
The pretask consisted of a listening activity, in which participants were given an answer sheet containing a list of target words that defined each one of the characters (Appendix A). They were asked to read the words out loud and ask for the meaning of any they did not understand. Then, they heard a recorded description of the dinner party and were asked to match each character with the words that related to them. Based on Willis’ (Reference Willis1996) task-based learning framework, the purpose of this priming stage was to familiarize learners with the task procedure, to introduce the characters’ personality traits, professions, and hobbies, and to provide learners with the necessary linguistic resources (i.e., semantic and phonetic form of words) to be able to successfully complete the main communicative task. In this way, we attempted to reduce cognitive demands at the level of resource-dispersing variables (Robinson, Reference Robinson and Robinson2011).
In the first part of the speaking task, participants were given a picture of tables with six characters sitting at them (two characters at three tables in the simple version, three characters at two tables in the complex version), and they were asked to carefully consider the seating arrangement and justify why it would not work based on the attendees’ personality traits, professions, and hobbies (Appendix B). They were given 1.5 minutes of planning time and were encouraged to provide as many reasons as they could think of by exploiting all personality features while considering an appropriate seating arrangement to ensure a comprehensible and thoughtful response. The same procedures were applied to the second part of the task, except that the tables were empty and participants were given six cards (one for each character) and were asked to decide on a new seating arrangement that would lead to a pleasant party. The task was closed and guided as it was key for assessment purposes that participants produced specific items containing the target vowels and consonants. However, participants were completely unaware that the task was designed to analyze their pronunciation and their attention was not drawn to the target phonetic forms at any time, hence, it was a purely pronunciation-unfocused task. The dinner table task, including its instructions and printable materials, is deposited in the open science repository, SLA Speech Tools (Mora-Plaza, Saito, Suzukida, Dewaele & Tierney, Reference Mora-Plaza, Saito, Suzukida, Dewaele and Tierney2022: http://sla-speech-tools.com/).
Finally, a task-performance questionnaire was administered immediately after learners had completed the task (Appendix C). They were asked to rate how well they had performed in the task, how difficult they had perceived the task to be, how much mental effort they had put into it, and how anxious they had felt during task performance on a 9-point scale (1 = very poorly, not difficult, no mental effort, not anxious; 9 = very well, extremely difficult, extreme mental effort, very anxious).
Target Words
The target words comprised names (e.g., Tilly Killey, John Butler) and adjectives (e.g., impulsive, deceitful) describing the characters’ personality, beliefs, occupation, and interests, and were identical in the simple and complex versions of the task, except for some characters’ names and surnames, which contained the same target vowels, but had to be different to identify different characters and their pictures. Both versions contained 27 words with the target consonants (10 /p/, 9 /t/, 8 /k/) and 36 words with the target vowels (/iː/, /ɪ/, /æ/, and /ʌ/; 9 each) (Appendix D). Obtaining a minimal number of productions of the target sounds from the L2 learners was a prerequisite for obtaining reliable acoustic measures.
VOT was chosen as an index of segmental pronunciation accuracy for consonants because it has been found to be sensitive to experience-related factors in L2 speech production for Catalan/Spanish learners of English (Gorba & Cebrian, Reference Gorba and Cebrian2021), and to linguistic environment (Olson, Reference Olson2020) and local contextual effects in code-switching tasks (Olson, Reference Olson2013). While English has long-lag voiceless stops (40–80 ms; e.g., Docherty, Reference Docherty1990), they are short lag in Spanish (7–20 milliseconds; e.g., Castañeda, Reference Castañeda1986). Due to a lack of awareness of cross-language differences between L1 and L2 voiceless plosives (Flege, Reference Flege and Strange1995) and reliance on different phonetic cues (e.g., presence or absence of closure voicing; Mora, Rochdi & Kivistö-de Souza, Reference Mora, Rochdi and Kivistö-de Souza2014), Spanish learners of English tend to produce English stops with “intermediate” VOT values falling short of English VOT.
The target vowels (/iː/, /ɪ/, /æ/, /ʌ/) form high and low phonologically contrastive vowel pairs in English (/iː/-/ɪ/ and /æ/-/ʌ/, respectively) that are difficult to distinguish qualitatively in perception and production for L1–Catalan/Spanish learners of English. This is because the vowels in each pair are perceptually mapped onto a single native vowel category that is acoustically located between the English vowels (Spanish /i/ for the high vowel contrast and Spanish /a/ for the low vowel contrast). Cross-language perceptual assimilation tasks (Cebrian, Reference Cebrian2019) show that English /iː/ and /ɪ/ are identified as Spanish /i/, whereas English /æ/ and /ʌ/ are identified as Spanish /a/ (Rallo Fabra & Romero, Reference Rallo Fabra and Romero2012). In addition, Spanish learners of English produce very small spectral distances in the production of these L2 vowel contrasts (Darcy, Mora, & Daidone, Reference Darcy, Mora and Daidone2016), failing to distinguish contrastive vowels effectively in production. Although the high vowels are less consistently identified as Spanish /i/ than the low vowels are as Spanish /a/, both are confusable vowel contrasts posing great difficulty in perception and production at the phonetic (prelexical) and lexical levels for advanced Spanish learners of English (Mora & Mora-Plaza, Reference Mora, Mora-Plaza, Nyvad, Hejná, Højen, Jespersen and Sørensen2019). Although Spanish learners of English have been shown to rely on temporal cues (i.e., duration) in the perception and production of /æ/-/ᴧ/ and /iː/-/ɪ/ (Cebrian, Reference Cebrian2006; Mora & Fullana, Reference Mora and Fullana2007; Rallo-Fabra & Romero, Reference Rallo Fabra and Romero2012), we opted for focusing on spectral rather than temporal cues, as recent research has shown that duration ratios for /iː/-/ɪ/ in production are not larger than those of NS (Cebrian, Gorba & Gavaldà, Reference Cebrian, Gorba and Gavaldà2021), and because temporal cues are likely to be more readily affected by speakers’ individual differences in speaking style (e.g., speech rate) and durational variability associated with the position of the target words in the utterance and their prosodic prominence. Therefore, potential task complexity effects on vowel production accuracy were examined only with respect to the spectral aspects of the target vowel contrasts.
Manipulation of task complexity
Task complexity was manipulated along ± reasoning demands (Robinson, Reference Robinson and Robinson2011) by varying the number of characters seated at each table (two vs. three), and the combination of personality traits (coherent vs. incoherent). The complex task was therefore more demanding than its simple counterpart because sitting three people at the same table with incoherent traits requires more cognitively demanding decisions. Therefore, the two versions of the task were identical in terms of several elements and target lexical items but differed in the characters’ names and the distribution of table and personality characteristics. Manipulation of task complexity in the present study differs from previous studies in that participants were given the target words they had to use both in the simple and complex versions of the task.
The outcome of the task-performance questionnaires revealed that overall learners perceived to have performed worse, feeling more anxious and employing more effort in the complex than the simple task. Learners perceived the complex task to be significantly more difficult (M = 5.11, SD = 1.82) than the simple task (M = 4.63, SD = 1.88; F[81] = 4.19, p = .044) and to require significantly more mental effort (M = 5.93, SD = 1.76) than the simple task (M = 5.41, SD = 1.73; F[81] = 9.40, p = .003), which is in line with Robinson’s hypothesis and empirical findings (Robinson, Reference Robinson2001b).
Procedures
Participants first filled in an online background questionnaire. In the lab, they performed an elicited imitation task (Wu et al., Reference Wu, Tio and Ortega2021), a Yes/No vocabulary size test (Meara & Miralpeix, Reference Meara and Miralpeix2015), and the L2 oral production task, immediately followed by the posttask questionnaire. The simple and complex versions of the speaking task were performed in two different days and counterbalanced in order to mitigate potential task interference and carryover effects, and minimize fatigue effects on participants, ensuring each version received undivided attention and engagement. Learners’ oral productions were recorded in a soundproof booth on Marantz PMD661 solid-state digital recorders with an external Shure SM58 voice microphone at a sampling frequency of 44.1 KHz.
Data analyses and speech production measures
The speaking task generated approximately 5 minutes of speech per task and participant. The average length of oral narratives was similar in both versions of the task for learners (Simple: M [sec] = 309, SD = 104, M [min] = 5.16; Complex: M [sec] = 308, SD = 108, M [min] = 5.13) and NS (Simple: M [sec] = 320, SD = 102, M [min] = 5.33; Complex: M [sec] = 334, SD = 132, M [min] = 5.56).
Recorded amplitude-by-time waveforms were automatically segmented into speech or pause (> 250 milliseconds) intervals using the Annotate to TextGrid (silences) command in Praat (Boersma & Weenik, Reference Boersma and Weenink2015), manually adjusted for segmentation inaccuracies, and orthographically transcribed. Filled and silent pauses, analysis of speech (AS) units, speech dysfluencies (repetitions), and lexical, grammatical, and pronunciation errors were manually annotated (Appendix E). The transcribed text and the corresponding audio files were submitted to the WebMAUS Basic automatic segmentation system (Schiel, Reference Schiel, Ohala, Hasegawa, Ohal, Granville and Bailey1999) to obtain labeled word and sound intervals.Footnote 1 Words containing the target oral stops and vowels were identified for acoustic measurement.
The VOT in the prevocalic voiceless oral stops (/p/, /t/, /k/) of word-initial stressed syllables in the target words was annotated manually in Praat and measured in milliseconds from the onset of the release burst of the stop consonant to the first positive peak of periodic energy of the following vowel. To exclude potential measurement errors, VOT durations outside 2.5 SD from each subject’s mean were screened (1.4%).
Vowel quality was measured by extracting frequency measurements (F1, F2) from a 10ms window by manually placing the cursor at the midpoint of the steady-state portion of the target stressed vowels. Frequency values were then converted to Bark (B), a psychoacoustic scale measure that changes frequency differences in vowel quality in terms of their impact on human perception and helps minimize interspeaker variability in vocal tract size. Bark-converted frequencies were then used to estimate the degree of vowel height (B1) and frontness (B2) and to determine distributions of vowel tokens for the English vowel categories /iː/, /ɪ/, /æ/ and /ᴧ/ on a two-dimensional B1–B2 space. Changes in vowel production accuracy resulting from task complexity were estimated through Mahalanobis distances between learners’ vowel productions and the corresponding vowel spaces of the control NS. Mahalanobis distances compute the distance in standard deviations between a point and the centroid of the distribution and take into consideration not only the centroid location but also the spread and orientation of the reference distribution, thus reflecting token variability (Kartushina, Hervais-Adelman, Frauenfelder & Golestani, Reference Kartushina, Hervais-Adelman, Frauenfelder and Golestani2015; Melnik-Leroy, Turnbull & Peperkamp, Reference Melnik-Leroy, Turnbull and Peperkamp2022). We computed Mahalanobis distance scores (DS) between vowels for the high (/iː/-/ɪ/) and the low (/æ/-/ᴧ/) vowel contrasts in the simple and the complex task as a measure of contrastiveness (i.e., how distinct vowel quality was within the contrast), hence, a larger distance meant less of an overlap between the two vowels. We also computed, for both tasks, Mahalanobis distances between learners’ and NS’ productions of these vowels (/iː/, /ɪ/, /æ/, /ᴧ/) as a measure of nativelikeness (i.e., how much learners’ vowel qualities approximate those of NS), hence, a smaller distance meant a more target-like production. Although increased contrastiveness between vowels is, by hypothesis, assumed to index improved accuracy in production in phonetic training studies using minimal-pair testing stimuli (e.g., Melnik-Leroy et al., Reference Melnik-Leroy, Turnbull and Peperkamp2022), the spontaneous nature of the oral task we used to elicit L2 speech did not allow us to have control over the stimuli the learners produced. Thus, a measure of nativelikeness separately computed for each target vowel was deemed more appropriate than a measure of contrastiveness to gauge task complexity effects on vowel production accuracy within subjects.
Measures of comprehensibility and accentedness were obtained from 13 NL who rated 164 speech excerpts approximately 45-sec long (M = 46.2, SD = 2.6) from the second part of the learners’ simple and complex versions of the speaking task on 9-point scales (1 = very difficult to understand / not accented at all, 9 = very easy to understand / very strongly accented). We extracted the excerpts from the second part of the participants’ performance owing to heightened cognitive demands. In this part, they had to propose a new seating arrangement that would guarantee a pleasant party after having provided reasons why the given arrangement would not work. The speech samples had been previously normalized for peak and mean amplitude and bandstop-filtered at 50Hz. Each rater judged each speech sample twice, first for comprehensibility and then for accentedness. These dimensions and the rating procedure were explained to raters in a training session. The rating task contained three practice trials from participants not included in the study. The 164 speech samples from the 82 learners were distributed randomly in four rating sessions taking place on different days. Within every session, lasting approximately 1 hour, all the speech samples were fully randomized and presented in blocks of 15 separated by short breaks in order to minimize the influence of familiarity. Interrater reliability (Cronbach’s alpha intraclass correlation coefficients) of the NL’ ratings was high for comprehensibility (α = .90) and accentedness (α = .94), so single mean scores per speech sample were computed by averaging across all ratings for each rated measure. No notable differences were detected between British and American raters in their evaluations of the speech samples.
Statistical Analyses
Fixed-effects structures were defined for each one of the models. Fixed and random-effects structures for all analyses in this study were selected based on the best fitting model (i.e., comparing Akaike information criterion [AIC] estimators across models), and random slopes were only included if they improved the model’s fit (i.e., AIC decreased), provided that the model could converge. Finally, Bonferroni adjustments were used for pairwise contrasts, and parameter estimates are reported in Appendix F. The assumptions of collinearity, normal distribution of residuals, and homoscedasticity were all met.
Results
Although we were expecting, in light of previous research (Jackson & Suethanapornkul, Reference Jackson and Suethanapornkul2013), the simple and the complex versions of the dinner table task to affect speech production linguistically in terms of lexical and grammatical accuracy and complexity, as well as in speaking fluency, we found the task manipulation effects to be very small in magnitude (see Table 2 below). The results indicated large interlearner variability in all the measures and very small effects (if at all observable) for lexical and grammatical complexity and accuracy, as well as for measures of speed and repair fluency. For breakdown fluency, the duration of pauses between AS units was substantially longer in the complex than in the simple version of the task.
A series of mixed-effects models were run in SPSS 27 on all the complexity, accuracy, and fluency measures, which included a random intercept for subject and task sequence (S>C, C>S) as a fixed factor in addition to task (simple, complex) to control for potential effects of learners performing the simple or the complex task first. The analysis did not reveal any significant effects of task, except for the duration of pauses at AS unit boundaries, which turned out to be significantly longer in the complex than the simple task (F[1, 161] = 12.98, p < .001), probably indicating greater conceptualization and formulation difficulties in the complex than the simple task. The effect of task sequence did not reach significance in any of these analyses. Thus, our task complexity manipulation affected breakdown fluency significantly, but not other dimensions of oral production (i.e., complexity, accuracy). The need to elicit specific forms for acoustic measurement, which forced us to constrain the task in terms of lexical choice, probably limited the variability in learners’ use of lexical and grammatical resources. However, as reported above, learners perceived the complex task as significantly more demanding, requiring greater effort, and posing higher levels of difficulty and anxiety compared to the simple task. Thus, we interpreted the significant effects of task complexity in breakdown fluency and learners’ post-performance self-reports to suggest that the two versions of the task could vary in how much attention to phonetic form they allowed, which could potentially lead to differences in pronunciation accuracy.
Task complexity and consonant production
Prior to assessing the effects of task complexity on learners’ VOT productions (RQ1), we looked into the VOT differences between learners’ and English NS’ productions of /p/, /t/, and /k/. Applying a square root transformation to VOT values, linear mixed-effects models with speaker group (NS, learner), consonant (/p/,/t/,/k/) and their interactions as predictors, and by-subject random intercepts, revealed significant main effects of speaker group (F[10985] = 33.11, p < .001), consonant (F[10985] = 726.27, p < .001) and a significant speaker group × consonant interaction (F[10985] = 55.52, p < .001). As expected, these results showed that NS produced significantly more aspirated consonants (M = 63.25 ms, SD = 21.61) than L2 learners (M = 42.21 ms, SD = 19.46) did (i.e., 49.85% longer VOT) and that /p/ was the least aspirated consonant, followed by /k/ and /t/. The VOT in the production of English /t/ tends to be more target-like in Spanish learners of English than /p/ and /k/ because learning to produce a different L2-specific place of articulation (alveolar in English vs. dental in Spanish) enhances overall articulatory accuracy in the degree of stricture and laryngeal timing (e.g., Mora, Reference Mora, Pérez-Vidal, Juan-Garau and Bel2008). The interaction arose because, although VOT differences were present in all target consonants, NS produced greater VOT values for /k/ > /t/ > /p/ and learners for /t/ > /k/ > /p/ (Table 3).
Note: M: mean, SD: standard deviation, CI: confidence interval.
The effects of task complexity on learners’ VOT productions were assessed by fitting the learners’ VOT to a linear mixed-effects model with a gamma regression function with task (simple, complex), consonant (/p/, /t/, /k/) and their interactions as fixed effects. Task sequence (S>C, C>S) was also included as a fixed effect to control for potential task order. The random-effects structure included a random intercept for subject (see Appendix F for parameter estimates). The model yielded main effects of task (F[9956] = 26.08, p < .001) because learners’ VOT productions were significantly more aspirated (i.e., 3.46% more accurate) in the simple (M = 43.99ms, SD = 23.34, 95% CI = 43.33–44.65) than the complex (M = 42.52ms, SD = 22.97, 95% CI = 41.89–43.15) task, and a significant main effect of consonant (F[9956] = 2907.01, p < .001) and task sequence (F[9956] = 7.02, p = .008). The task × consonant interaction did not reach significance (F[9956] = .86, p = .423). Bonferroni-adjusted pairwise contrasts indicated that the main effect of task was driven by the three target consonants: /k/ (t[9956] = 4.02, p < .001), /t/ (t[9956] = 2.57, p = .010) and /p/ (t[9956] = 2.29, p = .022) (Figure 1). The main effects of task were not significant (F[1021] = 2.30, p = .129) when the same model structure was applied to NS’ VOT productions.
Task complexity and vowel production
Task complexity effects on vowel distribution
Preliminary analyses of vowel quality unexpectedly revealed changes in NS’ vowel quality in the same direction as those observed in learners, which we attributed to the unbalanced distribution of the /æ/ and /ᴧ/ vowel tokens across words and the use of different proper names in the simple and the complex taskFootnote 2. Therefore, we decided to exclude all words corresponding to character names that were different in the simple and the complex version of the task (/iː/: Keane, Keith; /ɪ/: Killey, Pickett; /æ/: Ann, Kang, Sam, Tang; /ᴧ/: Butler, Cutler). Given the high frequency of these words, this screening procedure resulted in considerable data loss, as a further group of 17 learners had to be excluded for not meeting the criterion of having at least three tokens of each one of the four target vowels. Consequently, the final vowel data set consisted of a total of 5,426 vowel tokens for learners (N = 60) distributed relatively evenly by vowel contrast (/iː/: 1494; /ɪ/: 1668; /æ/: 1160; /ᴧ/: 1134) and 814 vowel tokens for NS (N = 8) (/iː/: 225; /ɪ/: 274; /æ/: 171; /ᴧ/: 144).
These vowel data (see Table 4) showed that learners produced a lower /iː/ (higher B1) and a higher /ɪ/ (lower B1) than NS did both in the simple and complex task, whereas the learners’ /æ/ and /ᴧ/ differed only minimally from NS’ /æ/ and /ᴧ/ in the simple task, but in the complex task /ᴧ/ was less target-like (higher B1, i.e., lower more /a/-like articulation). As regards fronting (B2), learners realized the lax vowels /ɪ/ and /ᴧ/ with higher B2 both in the simple and the complex task than NS did, indicating a more /iː/-like production of /ɪ/, and a more /æ/-like production of /ᴧ/. Such learner-NS differences indicate less target-like vowel quality in the production of the English lax vowels /ɪ/ and /ᴧ/ than in the production of the English tense vowels /iː/ and /æ/, which is consistent with English /iː/ and /æ/ being a closer match to the corresponding high /i/ and low /a/ Spanish vowel categories in perception and production. This is reflected in Figure 2, where the red ovals representing the distribution of learners’ /iː/ in the simple (S) and complex (C) tasks overlap more largely with NS’ distribution of /iː/ tokens (black dotted line) than the green ovals representing learners’ /ɪ/ tokens overlap with the distribution of NS’ /ɪ/ tokens. A similar picture can be observed for /æ/ and /ᴧ/. It is also clear from Figure 2 that (a) NS’ productions of /iː/ and /ɪ/ and of /æ/ and /ᴧ/ result in truly contrastive nonoverlapping distributions (black dotted lines), whereas for learners the distributions of contrastive high and low vowels overlap considerably (colored ovals), and (b) there is very little difference between learners’ vowel productions in the simple (solid-colored lines) and the complex task (dashed colored lines). We next present the analyses of Mahalanobis distances between contrastive vowels (contrastiveness) for learners and NS, and between learners and NS (nativelikeness) for each of the four target vowels.
Note: Data from 60 learners and 8 NS (17 learners were excluded because they had less than 3 realizations of the target vowels) after screening words that were different in the simple and the complex task (Ann, Butler, Cutler, Kang, Keane, Keith, Killey, Pickett, Sam, Tang).
The spontaneous nature of the speaking task generated large variability as regards the number of vowel tokens each participant produced and their quality (as assessed through first and second formant frequency measurements), which led to large dispersion clouds both in learners and NS. Still, the distribution of NS’ vowel productions shows much less overlap between /iː/ and /ɪ/ and between /æ/ and /ᴧ/, than the distribution of learners’ vowel productions (Figure 2), indicating a larger degree of contrastiveness in NS than in learners, as expected. In addition, as shown in Figure 3, task complexity had very little effect overall on vowel quality and the distributions of learners’ vowels do not seem to consistently present a larger overlap with NS’ distributions for either the simple (S) or the complex (C) task, indicating little effect of task complexity on the degree of nativelikeness of the target vowels.
Task complexity effects on vowel contrastiveness
For contrastiveness, the log-transformed Mahalanobis DS between the vowel spaces of the vowel pairs /iː/-/ɪ/ and /æ/-/ᴧ/ were submitted to fixed-effects models (separately for learners and NS) with task, contrast, and their interaction as fixed factors, and a random intercept for subject. We also included sequence (S>C, C>S) to control for potential task order effects (see parameter estimates in Appendix F). Overall, as expected (see Figure 4), the magnitude of the distinction learners made between contrastive vowels (between 4–10 SD) was much smaller than the distinction NS made (20–25 SD). For learners, tests of fixed effects revealed a main effect of contrast (F[1, 5430] = 380.7, p < .001) and a significant task × contrast interaction (F[1, 5430] = 7.5, p = .006), but the main effect of task did not reach significance (F[1, 5430] = 8.2, p = .121). This interaction arose because the log-transformed DS between the low vowels /æ/ and /ᴧ/ were significantly larger in the complex (8.83) than the simple task (7.17; t[5430] = 2.82, p = .005), whereas the DS between the high vowels /iː/ and /ɪ/ did not differ significantly across tasks (3.75 vs. 3.76; t[5430] = 2.82, p = .005). The main effect of contrast, caused by log-transformed DS being much larger in low (7.87) than in high (3.76) vowels, was mainly driven by the DS between /æ/ and /ᴧ/ in the complex task. It is uncertain why the more complex task led to larger contrastiveness for the low vowels but not for the high vowels as if the more complex task generated higher attention to phonetic form; we would expect the effect to be observable also in the high vowels. Running the same mixed-effects model on the NS’ data can help us elucidate this, as we would not expect the quality of their vowels and their degree of contrastiveness to change significantly as a function of task complexity. Unexpectedly, for NS main effects of task (F[1, 783] = 4.33, p = .038) and contrast (F[1, 783] = 7.43, p = .007) were found to be significant, whereas the task × contrast interaction did not reach significance (F[1, 783] = .005, p = .994). However, unlike learners, for NS, DS was larger overall for the high (26.05) than the low vowels (19.13). For NS, the main effect of contrast was driven by the joint contribution of DS being larger in high than low vowels both in the simple (25.21 vs 22.34; t[783] = 2.02, p = .044) and the complex (27.16 vs. 18.08; t[783] = 1.84, p = .066) tasks, whereas the main effect of task was driven by the joint contribution of DS being larger in the complex than the simple task (though nonsignificantly) in high vowels (27.15 vs 25.21; t[783] = 1.62, p = .106) and DS being larger in the simple than the complex task (though nonsignificantly) in low vowels (22.34 vs. 18.08; t[783] = 1.38, p = .169). The main effect of sequence did not reach significance for either learners (F[1, 5430] = .024, p = .876) or NS (F[1, 783] = .005, p = .994). As argued above, given the unbalanced contribution of vowel tokens across participants, we deemed Mahalanobis DS between contrastive vowel pairs to represent potential task complexity effects less reliably than Mahalanobis DS between learners’ and NS’ realizations of vowels embedded in the same set of target words even if, given the spontaneous nature of the speaking task, different speakers contributed a different proportion of target words to the final data set.
Task complexity effects on vowel nativelikeness
Mahalanobis DS between learners’ productions and NS’ vowel spaces (see Figure 5) do not show consistent task complexity effects on how native-like the vowel productions were. It was only for /æ/ that learners produced slightly larger DS with respect to NS, indicating a less target-like realization of /æ/, in the complex than the simple task. Mahalanobis DS (log-transformed) between learners’ vowel productions and the vowel spaces of the native vowels /iː/, /ɪ/, /æ/, and /ᴧ/ were submitted to a fixed-effects model with task, vowel, and their interaction, as well as task sequence as fixed factors and a random intercept for subject. Tests of fixed effects revealed a main effect of vowel (F[3, 5480] = 71.35, p < .001) because in both tasks /iː/ (S: 1.53, C: 1.54) was realized with smaller median DS (more accurately) than /ɪ/ (S: 2.70, C: 2.70), and /æ/ (S: 2.28, C: 2.66) was realized in a more target-like manner than /ᴧ/ (S: 3.18, C: 3.22). However, neither the effect of task (F[1, 5325] = .493, p = .483) nor the task × vowel interaction (F[1, 5325] = 1.98, p = .102) nor the effect of sequence (F[1, 5480] = 1.61, p = .205) reached significance. According to Bonferroni-adjusted pairwise contrast tests, the less accurate realization of /æ/ we observed in the complex (DS: 2.66) than the simple task (DS: 2.28) did not reach significance either (t[5480] = -1.84, SE = .032, p = .066) (see Appendix F).
Task complexity and global pronunciation ratings
Task complexity effects on comprehensibility and accentedness were assessed through linear mixed-effects models with task and task sequence as predictors of NL’ ratings, and by-subject and by-rater random intercepts. As shown in Table 5, learners’ speech was rated as less comprehensible (F[2129] = 3.72, p = .054), albeit nonsignificantly, and significantly more accented (F[2129] = 5.16, p = .023) in the complex than the simple task. Despite the relatively small task differences, increasing the demands of the task seemed to detrimentally affect their pronunciation globally. Task sequence did not appear to have a significant main effect on comprehensibility (F[2129] = 1.22, p = .269) or accentedness (F[2129] = 3.62, p = .057) (see Appendix F).
Associations between acoustic measures and global ratings
The relationship between acoustic measures (i.e., VOT, vowel quality) and pronunciation ratings (i.e., comprehensibility, accentedness) was assessed through Spearman rank-order correlation coefficients (for subjects with valid data for all measures, N = 77)Footnote 4. VOT was moderately related to comprehensibility (r[154] = .37, p < .001) and accentedness (r[154] = -.51, p < .001) ratings, indicating that learners with longer VOT were perceived to be more comprehensible and less strongly accented. In terms of vowel quality, comprehensibility was weakly associated with Mahalanobis distances between contrastive vowels (r[154] = .30, p < .001) and with respect to NS (r[154] = -.25, p = .002), suggesting that learners producing a larger contrast between the target vowels and producing them more accurately were perceived to be more comprehensible, whereas accentedness was weakly associated with Mahalanobis distances between contrastive vowels only (r[154] = -.30, p < .001). These associations were stronger in the complex than the simple task (see Table 6), suggesting that increased task demands strengthened these relationships, especially between VOT and accentedness.
Note:
* p < .05 (2-tailed);
** p < .001 (2-tailed).
Discussion
The current study did not find significant effects of task complexity on speech production in terms of lexis, grammar, and fluency (except for the duration of pauses at AS unit boundaries), against the predictions of the cognition hypothesis (Robinson, Reference Robinson and Robinson2001a, Reference Robinson and Robinson2011) and the outcomes of previous studies (e.g., Révész, Reference Révész2009). The effects were in the expected direction, but very small, which may be attributed to the nature of the speaking task, specifically designed to elicit target L2 sounds for acoustic measurement. This design provided learners with plenty of linguistic resources to perform the task, which most likely washed out task complexity effects that could have otherwise emerged. In addition, the rather advanced proficiency of L2 learners might have minimized the internal competition for attentional resources in the areas of complexity, accuracy, and fluency (Skehan, Reference Skehan and Bygate2015) during L2 oral performance. Investigating L2 speaking fluency, accuracy, and complexity from a dynamic approach (e.g., De Jong, Reference De Jong2023) could also provide further evidence of trade-off effects (e.g., fluency being impeded by difficulty in complex word retrieval) operating during oral production on a very small timescale.
In the present study, detrimental task complexity effects on learners’ pronunciation (RQ1: consonant production, RQ2: vowel production RQ3: global pronunciation ratings) were predicted, as complexifying tasks along resource-directing dimensions in pronunciation-unfocused tasks could draw learners’ attention away from phonological form because high task demands could pose serious limitations on learners’ ability to efficiently monitor their speech (Kormos, Reference Kormos1999, Reference Kormos2000). We thus expected Robinson’s (Reference Robinson and Robinson2001a, Reference Robinson and Robinson2011) cognition hypothesis not to hold for pronunciation accuracy (Kuiken & Vedder, Reference Kuiken, Vedder and Robinson2011; Mora et al., Reference Mora, Mora-Plaza and Bermejo Miranda2024). Such expectations were partly met, as learners’ VOT were more target-like in the simple than the complex task (RQ1), suggesting that increased task complexity during authentic communication interfered with learners’ ability to focus on segmental features of speech (Derwing et al., Reference Derwing, Munro and Wiebe1998). In other words, without an explicit focus on phonetic form, task complexity may not help enhance L2 pronunciation accuracy (e.g., Crowther et al., Reference Crowther, Trofimovich, Saito and Isaacs2018). These findings align with Skehan’s (Reference Skehan2009, Reference Skehan and Bygate2015) and Kormos’ (Reference Kormos1999, Reference Kormos2000) theory of attentional resource competition during L2 speech performance. Due to limited attentional capacity, learners’ attentional resources might have been divided between lexis, grammatical, and pronunciation accuracy, coming into competition during task performance. Although this study cannot provide solid evidence for a lexico-grammatical-pronunciation accuracy trade-off, previous studies have demonstrated trade-offs within one dimension (e.g., syntactic complexity). For example, Wang and Skehan (Reference Wang, Skehan and Skehan2014) and Skehan and Shum (Reference Skehan, Shum and Skehan2014) showed that subordination and phrasal complexity did not overlap when assessing the role of task structure. Because of the interdependence between linguistic areas, further studies should continue to investigate whether increasing task complexity might have a differential impact on different types of accuracy (i.e., lexical, grammatical, pragmatic, and pronunciation) while learners perform a communicative oral task.
For vowel production, no task complexity effects were found (RQ2). However, given the nature of the speech samples on which vowel quality was measured, and the fact that each subject contributed with a differing number of vowel tokens and words in the simple and complex tasks, no clear interpretation can be drawn from the current findings. Therefore, task complexity effects observed in both learners and NS affecting /æ/ and /ᴧ/ could in fact be a consequence of the unbalanced distribution of vowel tokens across words generated by the spontaneity of the task. In addition, the malleability of vowel articulation in terms of the contrastiveness and nativelikeness measures we used, especially in advanced learners, may be too limited to be able to capture any differences resulting from task complexity manipulation in pronunciation-unfocused tasks. VOT, on the other hand, may be a more malleable and salient feature and more readily affected by attentional resources being available to focus on pronunciation.
Task complexity effects on comprehensibility and accentedness (RQ3) were small, although they pointed in the direction of these global dimensions being negatively affected by task complexity (in line with what other studies had found, e.g., Crowther et al., Reference Crowther, Trofimovich, Saito and Isaacs2018). Interestingly, our analysis revealed differences in learners’ comprehensibility and accentedness attributable to task complexity despite there not being significant effects in terms of lexis, grammar, and fluency. The fact that increased task demands did not significantly affect learners’ comprehensibility can be partly related to the nature of comprehensibility as a multidimensional construct, including segmentals and suprasegmentals, fluency, lexical and grammatical accuracy and richness (e.g., Trofimovich & Isaacs, Reference Trofimovich, Isaacs, Isaacs and Trofimovich2016).
Regarding RQ4, acoustic and global measures of pronunciation accuracy were expected to be related in both tasks, especially in complex tasks, where task complexity may pose greater demands on L2 pronunciation. Results revealed a weak-to-moderate relation between segmental accuracy, comprehensibility, and accentedness. Learners who produced greater VOT in initial oral stops were perceived to be more comprehensible and less foreign-accented (i.e., more native-like), in agreement with Riney and Takagi’s (Reference Riney and Takagi1999) findings. Regarding vowel quality, larger Mahalanobis distances between contrastive vowels appeared to be related to higher comprehensibility and lower accentedness ratings, suggesting that the more distinctly learners produced the target vowel contrasts /iː/-/ɪ/ and /æ/-/ʌ/, the more comprehensible and less accented their speech was judged to be, providing some support for a relationship between acoustic distances of contrastiveness in vowel production and NL’ ratings of comprehensibility and accentedness previous research has found for measures of formant frequencies (Chan et al., Reference Chan, Hall and Assgari2016; Munro, Reference Munro1993; Porretta, Kyröläinen & Tucker, Reference Porretta, Kyröläinen and Tucker2015). Such associations were found to be stronger in the complex than the simple task (see Crowther et al., Reference Crowther, Trofimovich, Saito and Isaacs2018), especially for accentedness and VOT.
In sum, the results of the current study revealed that increased task complexity interfered with learners’ ability to focus on laryngeal timing (VOT) as a segmental feature of speech, leading to less-target like /p, t, k/ productions in the complex task. However, no significant effects of task complexity were found for vowel production. The findings also indicated small effects of task complexity on comprehensibility and accentedness, suggesting a potential negative effect on these global dimensions. Furthermore, a link was observed between segmental accuracy and the global dimensions of L2 speech, with greater VOT associated with higher comprehensibility and lower accentedness. The associations between acoustic and listener-based measures were stronger in the complex task.
Several methodological limitations suggest future research directions. For example, we were unable to control for the distribution of vowel tokens and words across tasks due to differences in word frequency across tasks and participants. The study also presented variability in the type of phonetic contexts in which target vowels were produced and measured, due to the nature of the oral production task. Whether and how this might have affected the acoustic measures and the comparison between simple and complex tasks is an empirical question warranting future research. It is possible that a more controlled communicative task (e.g., a task imposing a limitation on the use or repetition of vowel tokens) could have yielded more consistent vowel production and a more reliable interpretation of the results. Future studies should try to use more controlled oral communicative tasks involving greater control over word frequency and use of cognates, and more control over the nature of the tokens and the phonetic contexts (e.g., a task requiring the use of adjectives with similar consonantal contexts). More broadly, more experimental research is needed to gain a better understanding of how the manipulation of other task features (i.e., repetition, modality) may affect L2 pronunciation in unfocused tasks. A follow-up of the current study is to analyze potential trade-offs between lexico-grammatical accuracy and pronunciation during monologic/dialogic unfocused oral tasks (see Mora et al., Reference Mora, Mora-Plaza and Bermejo Miranda2024). Finally, in order to ensure the generalizability of the present findings, the study should be replicated with other groups of L2 learners with different ages, L1 backgrounds, proficiency levels, and experience. Controlling for individual differences in aptitude and speaking anxiety may also contribute to our understanding of the relation between task complexity and pronunciation.
Conclusion
The present study set out to explore the relationship between task complexity, pronunciation accuracy (vowel and consonant productions), and listener-based assessments of L2 speech (comprehensibility and accentedness) in pronunciation-unfocused tasks. The results suggest that complex tasks appeared to hinder the accurate production of English oral stops (e.g., VOT was less accurate in the complex than in the simple task) as well as comprehensibility and accentedness (e.g., speech was rated as less comprehensible and more accented by NL in the complex task), but no observable negative effects of task complexity were observed for vowel production.
In terms of pedagogical implications, future spontaneous tasks should be thoughtfully designed and manipulated to raise learners’ awareness about difficult L2 pronunciation targets while communicating. Increasing cognitive complexity in tasks whose pronunciation targets are essential for task completion (e.g., Mora-Plaza, Reference Mora-Plaza, Henderson and Kirkova-Naskova2023) is one way to promote a focus on phonetic form during interaction. Task-based pronunciation teaching holds a promising avenue for enhancing L2 pronunciation learning in communicative EFL classrooms.
Data availability statement
The experiment in this article earned Open Materials badges for transparent practices. The materials and data are available at https://www.sla-speech-tools.com/
Acknowledgments
The authors would like to thank Athenea Botey and Gonzalo Bermejo (University of Barcelona) for data transcription and annotation, Dr. Danielle Daidone (University of North Carolina Wilmington) for sharing and helping with Praat scripts to automatize acoustic analyses, Dr Joan Borràs-Comes (Laboratori de Fonètica “Eugenio Martínez Celdrán” at the University of Barcelona) for help with data visualization, and Dr Roger Gilabert (University of Barcelona) for expert advice on task complexity. The authors are also thankful to the project members Miren Adrian, Josh Frank, Natalia Fullana, Valeria Galimberti, and Gisela Sosa for their assistance in the study design, data collection, and analysis, and the three anonymous reviewers and handling editor for their insightful comments and helpful suggestions on earlier versions of the manuscript.
Funding statement
This study was supported by grants PID2019-107814GB-I00 and PID2022-138129NB-I00 from the Spanish Ministry of Science, Innovation and Universities, and by grant 2023SGR00303 from the Catalan Agency for Management of University and Research Grants (AGAUR).
Competing interest
The authors declare that they have no known competing financial, professional, contractual interests or personal relationships that could appear to influence the work reported in this paper.
Appendices
Appendix A. Pre-task: Instructions and materials
Listen carefully to the description of the following characters and link them to the words that describe their personalities and professions by putting their numbers next to the words.
Appendix B. The dinner table task: Instructions and materials
In this speaking task, we ask you to organize a successful dinner party for six people.
Please read the information about each person that is coming to the party carefully.
Your goal is:
-
1. To justify why the following seating arrangement will not guarantee a successful dinner party.
-
2. To provide and justify a new seating arrangement that can guarantee a successful dinner party: (1) take the individual cards, (2) place them on the chairs of the new table.
Refer to characters with names and surnames.
Foster smooth and pleasant conversations between the guests.
*The printable materials can be found online: http://sla-speech-tools.com
Appendix C. Task performance questionnaire
Appendix D. List of target words by task, vowel, and consonant
Appendix E. Coding procedure to obtain CAF measures
The annotation process employed a Textgrid structure in Praat (Boersma & Weenik, Reference Boersma and Weenink2015) with multiple tiers to capture various speech phenomena. The first Tier contained orthographic transcription, while the second Tier distinguished between speech (“s”) and pauses (“p”). The third tier categorized different types of pauses using four labels: p = pauses, pf = filled pauses, pi = internal pauses, pfi = internal filled pauses. Tier 4 marked AS units, and dysfluencies were annotated with codes -R- (repetition, restart, rephrasing, reformulation), and -S- (self-repair). Accuracy annotations included -L- (lexical error), -G- (grammatical error), and -P- (pronunciation error, e.g., phonemic substitutions). A visual representation is provided below.
Appendix F. Parameter estimates of fixed effects models
Learners vs. NS’ oral stops (voice onset time)
Reference levels: speaker group = NS; consonant = /k/.
Learners’ oral stops (voice onset time)
Reference levels: task = complex; consonant = /k/; sequence = C > S.
Vowels (Mahalanobis distances)
Reference levels: task = complex; contrast = /æ/-/ᴧ/; sequence = C > S.
Reference levels: task = complex; vowel = /ᴧ/; sequence = C > S.
Comprehensibility
Reference levels: task = complex; sequence = C > S.
Accentedness
Reference levels: task = complex; sequence = C > S.