1. Introduction
Learning pronunciation is a crucial yet difficult aspect of second language (L2) acquisition. This is particularly the case in foreign language settings where language learners are often faced with limited opportunities to practice their language, and particularly their oral skills. In large language classrooms, often typical of Taiwanese and more generally Asian universities (Butler, Reference Butler2011), learners may not be able to speak enough and, accordingly, receive feedback on their pronunciation. It is therefore not rare to observe situations in which foreign language learners struggle to be understood because of their pronunciation, which may lead to increased anxiety (Baran-Łucarz, Reference Baran-Łucarz2014). One way to enhance L2 learners’ pronunciation practice in a low-anxiety environment without making institutional changes is to rely on the ever-increasing availability of computer-assisted pronunciation training (CAPT) tools integrated with automatic speech recognition (ASR) (Bashori, van Hout, Strik & Cucchiarini, Reference Bashori, van Hout, Strik and Cucchiarini2022; Neri, Cucchiarini, Strik & Boves, Reference Neri, Cucchiarini, Strik and Boves2002). Although these systems do not replace human interaction, they offer valuable extra learning time and, in ideal situations, valuable feedback on the quality of the learners’ pronunciation (Levis, Reference Levis2007).
In the last few years, the ubiquity of ASR systems, such as voice assistants and dictation programs, has led to experimental studies assessing their potential in L2 pronunciation learning (Chen, Inceoglu & Lim, Reference Chen, Inceoglu, Lim, Kang, Staples, Yaw and Hirschi2020; Inceoglu, Lim & Chen, Reference Inceoglu, Lim and Chen2020; Liakin, Cardoso & Liakina, Reference Liakin, Cardoso and Liakina2015; McCrocklin, Reference McCrocklin2016, Reference McCrocklin2019; Mroz, Reference Mroz2018; Reference Mroz2020). What remains largely unknown, however, is how well ASR systems recognize accented speech. This is particularly important because the feedback that language learners receive (i.e. ASR-based written transcription of their oral production) may not match their level of intelligibility to native (L1) speakers. Accordingly, the current study investigated how the intelligibility of Taiwanese L2 English speech differed when assessed by native English speakers and ASR technology, and whether native speakers and ASR technology encountered the same type of intelligibility issues when assessing L2 English speech.
2. Literature review
2.1 L2 speech intelligibility
Accented speech is a normal and unavoidable aspect of L2 speech production (Flege, Munro & MacKay, Reference Flege, Munro and MacKay1995). Although many L2 learners and language teachers still view speaking without an accent as a desirable goal, pronunciation experts emphasize that instruction should be concerned not with nativelikeness but rather with intelligibility and comprehensibility (Derwing & Munro, Reference Derwing and Munro2015; Levis, Reference Levis2005). In other words, the primary focus should be on L2 learners being understandable regardless of how nativelike they sound, rather than on attempts to remove all traces of their native language in their speech. In L2 speech research, intelligibility is a construct that is differentiated from comprehensibility. Comprehensibility is defined as the perceived degree of effort required by a listener to understand what a speaker says, while intelligibility refers to the extent to which a listener understands what a speaker says (Munro & Derwing, Reference Munro and Derwing1995). Both are dimensions concerned with comprehension and communication, but they vary in their operationalizations. While comprehensibility is assessed through listener-based scalar ratings (i.e. very easy to very difficult to understand), intelligibility is measured, for example, by asking listeners to transcribe a speaker’s production, answer comprehension questions, or respond to true/false statements, thereby providing evidence of understanding (Derwing & Munro, Reference Derwing and Munro2015; Kang, Thomson & Moran, Reference Kang, Thomson and Moran2018). The difference between comprehensibility and intelligibility is important, because someone can be highly intelligible yet be perceived as difficult to understand.
Over the past decades, studies have demonstrated that accented speech does not necessarily lead to reduced intelligibility, and that although intelligibility and comprehensibility are closely related, the two constructs are not fully correlated (Derwing & Munro, Reference Derwing and Munro1997; Kennedy & Trofimovich, Reference Kennedy and Trofimovich2008). Research shows that intelligibility can be affected by several factors. In Munro and Derwing’s (1995) study, L1 English speakers transcribed word for word the utterances of Chinese L2 speakers of English. Results revealed that 28% of the listeners’ intelligibility scores were found to be correlated with phonemic errors (i.e. “deletion or insertion of a segment, or the substitution of a segment that was clearly interpretable as an English phoneme different from the correct one”; p. 80), whereas intelligibility was not correlated with phonetic errors (i.e. “production of a segment in such a way that the intended category could be recognized but the segment sounded noticeably non-native”; p. 80). Moreover, transcription errors were found to consist of substitutions (29%), omissions of content (24%) and function words (21%), regularizations (i.e. corrections of nontarget-like forms) (14%), and novel words (i.e. insertion of words with no phonological resemblance to words in the utterance) (12%). Kennedy and Trofimovich (Reference Kennedy and Trofimovich2008) also showed that listeners’ experience with non-native speech can influence how intelligible they find speech, and that less intelligible speech is associated with sentences with less semantic context (i.e. semantically anomalous sentences) regardless of the listeners’ experiences with L2 speech.
Importantly, research indicates that explicit pronunciation instruction can lead to improvements in intelligibility (and comprehensibility) (Derwing, Munro & Wiebe, Reference Derwing, Munro and Wiebe1997) and with stronger effects when learners’ pronunciation awareness is raised (Kennedy & Trofimovich, Reference Kennedy and Trofimovich2010). However, research also reveals that L2 learners’ self-assessment of their speech might not be very accurate (Li, Reference Li2018; Tsunemoto, Trofimovich, Blanchet, Bertrand & Kennedy, Reference Tsunemoto, Trofimovich, Blanchet, Bertrand and Kennedy2022). In particular, L2 learners at the low end of the comprehensibility (and accentedness) scale were found to mostly overestimate their performance, with particularly strong effects for native Chinese learners of English (Trofimovich, Isaacs, Kennedy, Saito & Crowther, Reference Trofimovich, Isaacs, Kennedy, Saito and Crowther2016). Considering the misalignment between learners’ self-assessment and native listeners’ judgments of their pronunciation, and the potential resulting communication problems, it would be very useful to raise learners’ awareness of this mismatch. With limited time for explicit pronunciation instruction in English as a foreign language (EFL) settings, CAPT – and, in particular, autonomous ASR practice as a way to raise pronunciation self-awareness – might contribute to pronunciation improvement.
2.2 Automatic speech recognition in L2 speech research
ASR, defined by Levis and Suvorov (Reference Levis, Suvorov and Chapelle2020: 149) as “an independent, machine-based process of decoding and transcribing oral speech (…) usually in the form of a text,” has tremendous potential for L2 learning and teaching. For this reason, over the last decade, an increasing number of studies have explored how ASR can contribute to pronunciation development and L2 speech research. As noted, from a general L2 acquisition perspective, ASR is a promising tool that helps increase speaking practice opportunities, which is especially needed in foreign language learning settings where speaking opportunities are often limited. ASR can also foster autonomous pronunciation learning (McCrocklin, Reference McCrocklin2016) and lead to increased motivation, self-confidence, and willingness to communicate (Mroz, Reference Mroz2018). Studies have also reported positive attitudes from learners using ASR, as this technology allows for the production of more output in low-anxiety environments (Ahn & Lee, Reference Ahn and Lee2016; Bashori et al., Reference Bashori, van Hout, Strik and Cucchiarini2022; Chen et al., Reference Chen, Inceoglu, Lim, Kang, Staples, Yaw and Hirschi2020; Chiu, Liou & Yeh, Reference Chiu, Liou and Yeh2007; Guskaroska, Reference Guskaroska2019; Wang & Young, Reference Wang and Young2015). For instance, the EFL learners in Ahn and Lee (Reference Ahn and Lee2016) who used an ASR learning system for self-regulated speaking practice in and out of the classroom for two weeks reported overall positive attitudes toward the application, especially regarding the provision of immediate feedback and the interactive nature of the simulated role-play.
ASR technology has become ubiquitous and can be found in computer-system interfaces across areas as diverse as health care, consumer services, and telecommunication (Levis & Suvorov, Reference Levis, Suvorov and Chapelle2020). In education, ASR has been built into different types of programs, each with its own technology, advantages, and limitations. These include commercial computer-assisted language learning systems, such as Tell Me More and Rosetta Stone, and an increasing number of web-based CAPT systems focusing on both segmental and suprasegmental features. Furthermore, ASR technology is at the core of dictation (also called voice-to-text) programs and free personal assistants, such as Google Assistant, Apple’s Siri, Amazon’s Alexa, and Microsoft’s Cortana. In terms of L2 learning and teaching, the portability and ubiquity of smartphones make ASR accessible to an ever-increasing number of language learners, and although ASR dictation programs do not provide explicit feedback on pronunciation, by providing written transcription they help learners monitor their speech and identify mispronunciation.
Recent studies have started to examine how ASR tools can enhance the learning of segmentals in instructional settings (Chen et al., Reference Chen, Inceoglu, Lim, Kang, Staples, Yaw and Hirschi2020; García, Nickolai & Jones, Reference García, Nickolai and Jones2020; Guskaroska, Reference Guskaroska2019; Inceoglu et al., Reference Inceoglu, Lim and Chen2020; Liakin et al., Reference Liakin, Cardoso and Liakina2015; McCrocklin, Reference McCrocklin2019). Generalizations are currently difficult to make due to the small number of studies and the variety of L1s/L2s investigated. However, a first observation is that ASR practice seems to be beneficial for the development of some L2 sound contrasts. For instance, Liakin et al. (Reference Liakin, Cardoso and Liakina2015) found that learners who completed pronunciation activities using ASR improved their production of the French /y-u/ contrast significantly more than learners who participated in the same pronunciation practice but with immediate oral feedback from their teacher instead of ASR. Similarly, García et al.’s (Reference García, Nickolai and Jones2020) 15-week-long classroom study revealed that a group of L2 Spanish learners who received ASR-based pronunciation training improved more on certain phonemes than the instructor-led pronunciation training group; the latter group, however, experienced more gains in comprehensibility.
Some of these studies have also shed light on the frustration that language learners felt when using ASR. Half the participants in McCrocklin (Reference McCrocklin2016) reported being frustrated by low rates of recognition or lack of convenience. Similarly, responses from a post-test questionnaire in Chen et al. (Reference Chen, Inceoglu, Lim, Kang, Staples, Yaw and Hirschi2020) revealed that only 22% of the learners intended to continue using ASR to work on their pronunciation. In their responses, learners reported that ASR did not work well because of the current state of the technology (22.5%), their pronunciation (6%), or a combination of both (71.5%). The study also showed that, on a 10-point Likert scale (with 10 being “excellent”), learners gave an average of 4.9 for how useful ASR was to practice English pronunciation and an average of 5.5 for how well ASR recognized their speech. The perceived usefulness of ASR might depend on a learner’s L1, L2, and proficiency level. A recent qualitative study on L2 French looked at how an ASR-based dictation program (i.e. Google Assistant) could help raise learners’ awareness of how intelligible their speech is (Mroz, Reference Mroz2018). The study revealed that most learners regarded ASR as a useful pronunciation diagnostic tool and found that ASR output was a credible representation of their intelligibility.
As noted above, an increasing number of studies have explored the benefits of ASR-based dictation programs for the learning and teaching of L2 pronunciation. Further research is needed on how ASR-based dictation programs recognize non-native speech. This is particularly relevant as the feedback – that is, the ASR-based written transcription of their oral production – that language learners receive might not match their intelligibility to native speakers. Twenty-two years ago, Derwing, Munro and Carbonaro (Reference Derwing, Munro and Carbonaro2000) compared ASR output with native listeners’ intelligibility transcriptions of L1 and L2 English speakers (L1 Spanish and L1 Cantonese) who used the dictation program NaturallySpeaking from Dragon Systems. Their results showed that the ASR program recognized L2 speech less accurately than L1 speech, and approximately 24–26% fewer words produced by L2 speakers than the native listeners did. Their study also highlighted a discrepancy between the recognition errors encountered by L1 listeners and ASR. The last two decades have witnessed significant improvements in the way ASR systems recognize speech (Litman, Strik & Lim, Reference Litman, Strik and Lim2018), but very little new research has been undertaken on the recognition of L2 speech by ASR technology compared to native listeners. A recent exception is McCrocklin and Edalatishams (Reference McCrocklin and Edalatishams2020), who partially replicated Derwing et al.’s (Reference Derwing, Munro and Carbonaro2000) study using Google’s ASR. Although the ASR recognition score of L1 oral speech was higher than for L2 oral speech, the difference between the language groups was small (e.g. 90.99% for L1 Mandarin Chinese and 96.20% for L1 English). In addition, the ASR recognition scores were similar to those by L1 listeners, with a maximum mean difference of 2.04% for the Chinese group – a much smaller difference than what Derwing et al. (Reference Derwing, Munro and Carbonaro2000) had observed (i.e. a 22.54% difference).
Accordingly, the goal of the current study was to compare how 12 native listeners of English and Google Assistant ASR technology recognize isolated words and sentences produced by Taiwanese low-intermediate learners of English. The research questions that guided this study were:
-
1. Does intelligibility of L2 English speech (isolated words and sentences) differ when assessed by native speakers versus ASR technology?
-
2. Do native speakers and ASR technology encounter the same type of intelligibility issues when assessing L2 English speech?
3. Methodology
3.1 Participants
3.1.1 L1 listeners
Twelve listeners (six female) were recruited to participate in intelligibility judgment tasks. Their ages ranged from 28 to 48 years, with an average age of 34.5 years. Eight participants were from North America, three from Australia, and one from the UK but living in Australia. They all reported good knowledge of a second language, but none spoke Chinese or had previously lived in a Chinese-speaking country. On a scale from 1 (“not familiar at all”) to 5 (“extremely familiar”), they reported very low familiarity with the Chinese language (M = 1.6, SD = 0.79) and moderate familiarity with Chinese-accented English (M = 3.4, SD = 0.92). However, on a scale from 1 (“not at all”) to 5 (“very frequent”), they reported very frequent contact with accented English (M = 4.3, SD = 0.75).
3.1.2 L2 speakers
The data used in this study consisted of speech produced by four Taiwanese EFL learners (one female: “Speaker 1”, three male: “Speakers 2, 3, and 4”) studying at a technological university in Taipei. They were part of a larger study examining the effects of ASR practice on pronunciation learning and were randomly chosen for the current study from a subgroup of participants whose recordings were of high quality (i.e. the microphones captured speakers’ voices very well; no breakdown with ASR turning off in the middle of a sentence). Three of the participants were 19 years old and one was 20 years old; they reported having studied English for 8 to 12 years (average = 9.5 years). They were all enrolled in an English language course that met twice a week (total of 2.5 hours) for 18 weeks. The course was taught by a Taiwanese native speaker and targeted reading and listening skills, with a focus on the development of grammar and vocabulary. Limited time was dedicated to pronunciation, but parts of the homework assignments included pronunciation practice with ASR technology, which served as speech stimuli for the current study. On a scale from 1 (“very poor”) to 10 (“nativelike”), participants self-reported average English proficiency levels of 3.75 in reading, 1.75 in writing, 3 in listening, and 2.75 in speaking. None of them had lived in an English-speaking country and they estimated their exposure to English outside class at around 1.5 hours per week (e.g. listening to songs and watching movies or TV series. None of the learners reported speaking English outside class). They also estimated their degree of accentedness as 6.25 (1 = “no accent at all”, 9 = “very strong accent”) and their degree of comprehensibility as 3.75 (1 = “people cannot understand my pronunciation very easily”, 9 = “people understand my pronunciation very easily”).
3.2 Speech stimuli
The speech samples used in the current study were elicited from monosyllabic word-reading and sentence-reading tasks. These tasks were assigned as homework practice and served as autonomous training (see Chen et al., Reference Chen, Inceoglu, Lim, Kang, Staples, Yaw and Hirschi2020). The word list consisted of 24 sets of minimal pairs targeting vowels known to be problematic for Taiwanese learners of English (i.e. /ɪ-i/, /æ-ε/) and thus directing close attention to these contrasts. These minimal pairs were presented in groups of four across the six training sessions and were not repeated, leading to a total of 48 monosyllabic words across tasks. The sentence list consisted of 24 short sentences targeting difficult consonants (i.e. /θ-s/, /tʃ-dʒ/, /b-v/, /r-l/), such as in the sentence “It’s time to collect your vote.” These sentences ranged in length from 3 to 9 words (average = 6.5) and were presented in groups of four, again with no repetition across sessions (see IRIS repository for the list of stimuli).
3.3 Procedures
3.3.1 Recording by L2 speakers
The four L2 speakers recorded the stimuli on their Android phones using Google Assistant. The recordings were part of weekly homework assignments over a period of three weeks. For each assignment, students were asked to practice reading the sentences and minimal pairs described above and a short text (not analyzed in this study). They had four days to submit their recordings. Because this exercise aimed to promote learner autonomy and increase speaking practice, the tasks were recorded outside of class and submitted to the instructor. Students were shown how to install a screen recorder that captures both the screen (i.e. ASR feedback) and voice. These video recordings helped to ensure that the students completed the activity, in addition to monitoring the quality of the feedback they received. The learners were not given a limit regarding how many minutes they should spend on the activity or how many times they should repeat the sentences and words. Attempts to produce a stimulus range from one single attempt (either with correct or incorrect ASR output) to up to three minutes of practice on the same word. Figure 1 shows a screenshot of a participant’s video-recorded ASR practice. In this example, it took seven attempts for the learner to get “It’s time to collect your vote” recognized, his pronunciation of “vote” being first understood as “boat,” then “vault,” and then “road.”
3.3.2 Preparation of speech samples for rating
The L2 speakers’ six ASR video-recorded practices were converted to WAV files. Each sentence (4 sentences × 6 sessions) and each word from the four minimal pairs (8 words × 6 sessions) were extracted from the audio files and saved as a separate file. In most cases, the learner’s first attempt to produce the stimulus was selected, because it was deemed to reflect the learner’s current pronunciation – that is, how they would produce a word without further prompting or feedback. In a few cases, a production other than the first was chosen because of some irregularity in either the recording or the learner’s production (e.g. coughing, door slam, ASR not reacting). The files were then normalized for peak intensity to reduce differences in perceived loudness, and 350–400 milliseconds of silence were added at the beginning and end of each speech sample.
3.3.3 Intelligibility rating by ASR
The coding of how the ASR program interpreted the L2 speakers’ speech was done at the same time as the speech sample preparation for the L1 listeners’ rating task. This ensured that the exact same stimulus, either word or sentence, was used for the intelligibility analysis by ASR and native listeners. The coding was done by examining the videos of the L2 speakers’ practices (see Figure 1) and reporting the ASR output. Because ASR was used in real time, the process was dynamic, and the final output was sometimes not generated immediately. In other words, the intermediate result of ASR processing was displayed shortly before being replaced by the final ASR decision. For the monosyllabic word task, the analysis focused on the first production of the word, ignoring learners’ repetitions and thus potential impacts on recognition. For the sentence task, the fact that the ASR algorithm relies on some linguistic context is not considered a problem for the current study because L1 listeners also rely on context (Kennedy & Trofimovich, Reference Kennedy and Trofimovich2008); thus, the analysis focused on the final transcription provided by ASR.
3.3.4 Intelligibility rating by L1 listeners
The rating test, composed of two tasks, was administered remotely using Qualtrics. In the first task, listeners were presented with 24 sentences and were asked to transcribe them (in standard orthography) as accurately as possible. They were told that they could listen to each speech sample only once and were encouraged to guess when they were unsure. Directly after transcribing each sentence, listeners made a judgment of comprehensibility on a 9-point Likert scale (1 = “extremely difficult to understand” to 9 = “extremely easy to understand”) (Derwing & Munro, Reference Derwing and Munro1997). The sentences produced by the four L2 speakers were distributed between the 12 listeners so that each listener was presented with (a) each sentence only once and (b) each L2 speaker an equal number of times. In other words, there were 12 unique presentation lists with each L2 speaker’s production assessed by three L1 listeners each time (e.g. sentence 1 produced by Speaker 1 was rated by Listeners 1, 2, 3; sentence 2 produced by Speaker 1 was assessed by Listeners 4, 5, 6, etc.). Speaker 4 failed to produce the sentence “he is thinking” with good enough audio quality (i.e. brief background noise); the analysis for this participant was, therefore, based on 23 sentences, totaling 151 words instead of 154 words. In the second task, L1 listeners transcribed a total of 184 words (i.e. Speakers 1 and 2 produced all 48 words; Speakers 3 and 4 failed to produce 1 and 7 words, respectively) and were instructed to write down only English words and guess when unsure. Again, the stimuli were presented only once. The two tasks were preceded by two practice samples, the order of stimuli presentation was randomized, and the speed of the experiment was controlled by the L1 listeners. The whole experiment took an average of 30 minutes.
3.4 Analysis
Intelligibility was operationalized as the accuracy of the L1 listeners’ transcriptions and ASR output when compared to the original list provided to the L2 speakers. Each speech sample was coded by two of the authors for exact word match, with word omissions and substitutions considered as errors, except in the case of substitution with homophonous words (e.g. feat and seen instead of feet and scene), ungrammatical transcriptions (e.g. your and its instead of you’re and it’s), and omissions of contractions (e.g. it is and they are instead of it’s and they’re). For the word transcription task, one point was awarded if the word was accurately recognized and zero otherwise. These scores were averaged across L1 listeners to obtain a general score for each single word, which was then totaled across the 48 words to yield a total percentage word intelligibility score for each of the four L2 speakers. For the sentence task, the L1 listeners’ scores and ASR intelligibility scores were calculated as the number of correctly transcribed words over the total number of words. For instance, if the sentence “It’s time to collect your vote” was transcribed as “It’s time to collect your boat,” the score would be 83% (5 out of 6). Similar to the word task, percentage sentence intelligibility scores were calculated for each speaker by averaging the intelligibility scores across L1 listeners and sentences. Error types were coded by the three authors, with any disagreement discussed until mutual agreement was reached. Comprehensibility ratings for the sentence task were calculated by averaging the L1 listeners’ rating scores across speech samples for each of the four L2 speakers. Cohen’s kappa results were interpreted as follows: values range of 0.01–0.20: none to slight agreement; 0.21–0.40: fair; 0.41–0.60: moderate agreement; 0.61–0.80: substantial agreement; and 0.81–1.00: almost perfect agreement.
4. Results
4.1 Interrater reliability
Fleiss’s kappa was used to determine the level of interrater reliability among the 12 L1 listeners. The results revealed a moderate agreement among the listeners (κ = 0.46, z = 50.24, p < 0.001), with moderate agreement for Speaker 2 (κ = 0.44, z = 24.45, p < 0.001), Speaker 3 (κ = 0.46, z = 25.46, p < 0.001), and Speaker 4 (κ = 0.54, z = 27.85, p < 0.001), and fair agreement for Speaker 1 (κ = 0.36, z = 20.23, p < 0.001) (Landis & Koch, Reference Landis and Koch1977). To further examine the interrater reliability among the listeners and ASR, a series of unweighted Cohen’s kappa tests was run for pairwise analyses. The results ranged from 0.164 (slight agreement) to 0.674 (substantial agreement). L1 listeners showed moderate (49 instances) to fair (16 instances) agreement among themselves, with one case of substantial agreement between two listeners. Conversely, three listeners showed slight agreement with ASR, while the rest showed only fair agreement.
4.2 Intelligibility of isolated words
Overall, ASR recognized 40.81% of the L2 speakers’ word production, while the L1 listeners’ average intelligibility score was 38.62% (Table 1). The results for each individual speaker revealed interesting differences, with Speakers 1 and 2 receiving higher scores with ASR than L1 listeners, and with the production of Speakers 3 and 4 assessed higher by L1 listeners than by ASR. Interestingly, Speaker 1 received the highest ASR recognition score (47.92%) among the four speakers, as well as the lowest listeners’ intelligibility score (36.46%). The reverse pattern was observed for Speakers 3 and 4, who received the two lowest ASR recognition scores (36.17% and 33.33%, respectively) and highest listeners’ intelligibility scores (40.78% and 40.24%, respectively). Because of the non-parametric nature of the ASR data, Kendall’s tau-b correlations were run to determine the relationship between the ASR and L1 listeners’ scores for all the words and for each speaker. There was a statistically significant positive weak correlation between the two scores for Speaker 1, τb(48) = 0.403, p < 0.001, and Speaker 3, τb(47) = 0.381, p = 0.003. However, the correlation was not significant for Speaker 2, τb(48) = 0.144, p = 0.256, and Speaker 4, τb(41) = 0.138, p = 0.318.
a The analyses for Speakers 3 and 4 were based on 47 and 41 words, respectively.
b Maximum rating of 9.
c Standard deviations represent the dispersion of scores among L1 listeners (not calculable for ASR).
d Standard deviations represent the dispersion of the scores across sentences (for ASR and L1 listeners).
To shed light on (mis)alignment between ASR and L1 listeners, the 184 words produced by the four speakers at the Word task were categorized according to (1) whether they were accurately recognized or not by ASR, and (2) whether ASR agreed with all, half or more, less than half, or none of the listeners. Figure 2 illustrates a continuum of (mis)alignment; the left column shows that ASR fully misaligned with all 12 listeners on 46 words: 13 of these words were recognized by ASR but by none of the listeners, and 33 words were recognized by all listeners but failed to be recognized by ASR. An explanation for these misalignments is not readily apparent, as they were not caused by consistent patterns with specific words (e.g. one word never recognized by ASR) or one of the four speakers. Conversely, the right column reports 11 cases of full alignment between all the listeners and ASR, a low number, but not surprising considering the moderate agreement amongst listeners. In addition, there is a continuous increase in the percentage of words recognized by ASR within each column (i.e. 28%, 34%, 56%, 64%), indicating that the more listeners there are who recognize a word, the better it tends to be recognized by ASR. In other words, ASR is more likely to recognize words that are correctly identified by many listeners.
4.3 Intelligibility of sentences
The total ASR recognition scores for the sentences were 75.52%, with relatively large differences across sentences. By comparison, L1 listeners’ intelligibility scores were higher, at 83.88%, also with sizable variation across sentences. Contrary to the observation in the word task, all ASR recognition scores for sentences were lower than L1 listeners’ scores, with a maximum ASR score of 79.08% (Speaker 1) and a minimum L1 listeners’ score of 79.80% (Speaker 3). A series of Pearson correlations were conducted to explore the relationship between the ASR and L1 listeners’ scores. The results showed that the scores were moderately positively correlated for Speaker 1, r(24) = 0.692, p < 0.001, and for Speaker 2, r(24) = 0.555, p = 0.005, but not for Speaker 3, r(24) = 0.102, p = 0.63, or Speaker 4, r(23) = 0.015, p = 0.945.
4.4 Type of intelligibility issues
The second goal of this study was to further explore the types of intelligibility issues encountered by L1 listeners and ASR technology. To do so, the intelligibility data (i.e. ASR output and listeners’ transcriptions) for both the monosyllabic word task and the sentence task were analyzed. For the word task, errors were coded according to one of the categories reported in Table 2. The analysis revealed that the most common errors were due to an incorrect vowel, both with ASR (47.7% of the total errors) and with L1 listeners (53.2%), confirming that the Taiwanese L2 speakers struggled with pronunciation of the target vowels. The ASR data also showed that the second most common type of error involved instances in which both the vowel and one consonant (either initial or final) were incorrect (20.6%), while the third most common errors were instances with a correct vowel but an incorrect consonant (17.8%). These two categories were also the second and third most common error types for L1 listeners, albeit in slightly smaller proportions (17.4% for both). Finally, approximately 14% (for ASR) and 11.6% (for L1 listeners) of error types were instances of more complex errors, such as the addition of a segment, missing consonants, or a combination of multiple errors.
Note. ASR = automatic speech recognition.
The errors in the sentence task were tallied and organized into categories adapted from Derwing et al. (Reference Derwing, Munro and Carbonaro2000) (Table 3). Because our analysis focused on the intelligibility transcriptions compared to the original list of sentences (as opposed to the phonemic transcriptions of the speakers’ speech), their category “normalizing errors” (i.e. in which an incorrectly produced word was repaired) was not included. Overall, the type and distribution of errors were similar across ASR and L1 listeners, with three categories totaling 75.1% (for ASR) and 79.6% (for L1 listeners) of the errors. Specifically, most of the errors for both ASR and L1 listeners concerned two or three segment substitutions (32.0% and 33.1%, respectively), followed by one segment substitution (26.1% and 31.5%, respectively), and word substitution (17.0% and 12.3%, respectively). In addition, the percentages of extra words, syllables, and segments were equally small (2.5% to 3.9%) for the two groups. Finally, word omission was more frequent for ASR (10.5%) than for L1 listeners (6.3%), while the percentage of segment omission was higher for listeners (6.6%) than for ASR (1.3%).
5. Discussion
The first goal of the current study was to explore whether 12 L1 English listeners and Google’s ASR program recognized productions of L2 English (i.e. individual words and short sentences) in the same way. Results revealed that the overall word recognition scores were quite similar between the L1 listeners (38.62%) and ASR (40.81%), which could initially be interpreted as positive for the use of ASR as a pronunciation practice tool. However, individual scores for each of the L2 speakers highlighted stark contrasts: the two speakers (1 and 2) who received higher intelligibility scores with ASR received the lowest intelligibility scores from L1 listeners. In fact, when ranking speakers based on their intelligibility, we observed almost opposite patterns, from higher to lower scores, with Speakers 1, 2, 3, and 4 for ASR and Speakers 3, 4, 2, and 1 for L1 listeners. The reason is unclear, and watching the video practices of the speakers did not reveal any specific strategy (e.g. exaggerated, slow enunciation) used by Speakers 1 and 2, which would have explained why they were more successful with ASR. A second important finding of the word task analysis is that the ASR and L1 listeners’ scores correlated for only two of the L2 speakers: Speaker 1, with the highest percentage difference between the ASR and L1 listeners’ scores (11.5%), and Speaker 3, with the smallest difference (4.6%). This suggests that despite relatively similar scores, the intelligibility ratings by ASR of Speakers 2 and 4 did not mirror L1 listeners’ ratings. Importantly, it also indicates that the extent to which Google’s ASR can approximate L1 listener performance at recognizing EFL learners’ speech may depend on individual speakers and listeners.
This study showed that overall there were more cases where the recognition by ASR was misaligned with that of human raters; that is, for 60% of the time, ASR was not aligned with any or was aligned with less than half of the listeners. Interestingly, the proportion of cases where ASR and the listeners aligned in recognizing words (vs. misrecognizing words) grew when the number of listeners who recognized the words was larger. The fact that the listeners recruited for this study all reported familiarity with accented English is a factor to consider for future studies, as results might have been different had the listeners been less familiar with accented speech. Yet, considering that Google’s ASR technology relies on deep learning architectures (Levis & Suvorov, Reference Levis, Suvorov and Chapelle2020) and extensive model training on speech corpora of very large sizes, comparing ASR technology to listeners accustomed to accented speech may be considered more valid. Likewise, an item-by-item analysis of the (mis)alignment between ASR and L1 listeners was too complex for the current article considering how speech perception is related to the listeners – who moderately agreed among themselves, the individual speakers, and the lexical items. Yet, considering the current preliminary findings and the paucity of research looking at L2 speech perception by human listeners versus ASR, further investigation of the sources of (mis)alignments seems warranted.
Turning to the findings for the sentence task, the current study showed that L2 speakers’ utterances were better recognized (i.e. more intelligible) by the L1 listeners (83.88%) than by the ASR program (75.52%). This pattern was observed across the four speakers, but with differences between the two scores ranging from 3.57% (Speaker 3) to 15.30% (Speaker 2). Importantly, significant positive correlations between ASR and L1 listeners were found only for two of the participants: Speaker 1 (like for the word task) and Speaker 2. These results are not entirely compatible with those of McCrocklin and Edalatishams (Reference McCrocklin and Edalatishams2020), who reported that, for their 10 Mandarin Chinese speakers, Google’s recognition scores (i.e. 90.99%) were correlated with listener recognition scores (i.e. 88.95%). This difference, along with their ASR recognition scores being 15% higher than in the current study, could be attributed to differences in proficiency levels between our lower-intermediate EFL participants and their upper-intermediate learners in the United States. Another possibility is that the sentences we used targeted difficult phonemes and were, therefore, more challenging to produce and to recognize. Interestingly, the ASR scores observed in the current study were very close to those reported by Derwing et al. (Reference Derwing, Munro and Carbonaro2000) (i.e. 72.45%) with advanced L2 speakers in Canada. The fact that these two groups of L2 speakers obtained similar recognition scores when using ASR – despite a large difference in their proficiency and comprehensibility scores – may be directly due to technological advancements in the recognition of L2 speech by ASR over the past two decades, indicating the higher potential for lower proficiency learners offered by current ASR technology.
Additionally, the intelligibility scores in the sentence task (ASR: 75.52%, L1 listeners: 83.88%) were approximately double those obtained in the isolated word task (ASR: 40.81%, L1 listeners: 38.62%). This difference can be explained by the importance of linguistic context in speech perception. Kennedy and Trofimovich (Reference Kennedy and Trofimovich2008), for instance, pointed out that decontextualization can lower intelligibility. An additional explanation is that the word list targeted difficult vowels that were not emphasized in the sentences, where the challenge lay mainly in difficult consonant contrasts (e.g. [r-l], [s-ʃ]) present in only a few words per sentence. The sentence, therefore, contained many unproblematic words that were more easily understood by the listeners and ASR and that contributed, with the help of linguistic context, to higher intelligibility rates. From a research and pedagogical perspective, using isolated words allowed for the examination of challenging vowel contrasts; learners’ attention was fully directed toward the production of the target vowels, and the ASR written output informed learners of their vowel intelligibility. This type of activity appears to be particularly useful for the learning of segmentals, especially since learners can repeat single words as many times as they need (Liakin et al., Reference Liakin, Cardoso and Liakina2015; McCrocklin, Reference McCrocklin2019). The activity did not involve the practice of spontaneous speech, which, though ecologically more valid, would not allow for direct comparisons of speech samples across participants. The use of sentences nonetheless enabled the practice and analysis of longer speech samples and provided an indication of intelligibility at the multi-word level.
One implication of this study relates to learners’ self-assessment and pronunciation awareness. As previously mentioned, it is challenging for L2 learners to accurately self-assess their pronunciation (Li, Reference Li2018; Strachan, Kennedy & Trofimovich, Reference Strachan, Kennedy and Trofimovich2019; Trofimovich et al., Reference Trofimovich, Isaacs, Kennedy, Saito and Crowther2016). Yet, as noted by Tsunemoto et al. (Reference Tsunemoto, Trofimovich, Blanchet, Bertrand and Kennedy2022), “self-assessment is considered central to learners’ autonomy” (p. 135) and “accurate self-assessment in pronunciation is particularly important because teachers often lack time, resources, and training to provide pronunciation-specific instruction” (p. 136) The use of ASR in the course where the current data was collected aimed to serve this exact purpose: to provide EFL learners the pronunciation practice that is lacking in their courses and, thereby, promote self-regulation and autonomy. Thus, learners’ trust in the ability of ASR to recognize their speech in the same way a native speaker would is an important factor in encouraging autonomous practice. In the current study, learners were not directly asked whether they thought their intelligibility differed when assessed by native speakers and ASR technology – and their beliefs on ASR is beyond the scope of this paper; however, when asked to assess how well ASR recognized their speech from 1 (“very poor”) to 10 (“excellent”), their scores varied from 4 (Speaker 1) to 6 (Speaker 4) and 7 (Speakers 2 and 3). For cases like Speaker 2, who appeared to trust ASR and whose speech was better recognized by L1 listeners (86.04%) than by the ASR program (70.74%), ASR practice might lead to a flawed negative self-assessment of their pronunciation and potential frustration with the technology (also see Chen et al., Reference Chen, Inceoglu, Lim, Kang, Staples, Yaw and Hirschi2020). Indeed, when asked in a post-study questionnaire whether they intended to continue using ASR in English, Speaker 2 responded, “No, because sometimes ASR failed to recognize my speech.” Conversely, Speaker 1 assessed ASR’s ability to recognize her speech quite low; yet her intelligibility was even lower based on the L1 listeners’ ratings. The relationship between the intelligibility of L2 speech by ASR and L1 listeners, and learners’ self-assessment of their intelligibility (and whether it changes after ASR practice) is an interesting area of exploration for future studies.
The second part of this study focused on the types of intelligibility or recognition issues encountered by L1 listeners and the ASR program. This is an important research focus because, for ASR to be useful as a pedagogical tool, it should approximate human listeners’ perceptions. With regard to the production of words in isolation, the findings revealed that the ASR program and L1 listeners identified similar proportions of error types. In particular, the most frequent error across the two groups was a vowel substitution, totaling 47.7% (ASR) and 53.2% (L1 listeners) of the errors, followed by a combination of errors of one vowel and one consonant (ASR: 20.6%, L1 listeners: 17.4%). This is not surprising considering that the word list consisted of minimal pairs targeting difficult vowel contrasts, as used in a larger project on the training of English vowels by EFL learners. The results, therefore, confirmed that these L2 speakers struggled with the production of target English vowels, especially the lax/tense distinction (Chang & Weng, Reference Chang, Weng, Levis and LeVelle2013; Hu, Tao, Li & Liu, Reference Hu, Tao, Li and Liu2019), and that this affected their intelligibility both by ASR and L1 listeners.
With regard to the intelligibility and recognition of whole sentences, the analysis revealed striking similarities between the ASR program and L1 listeners. Interestingly, in line with the results of the word list, more than half of the recognition issues concerned segment(s) substitutions. Meanwhile, the ASR and L1 listeners similarly identified relatively few instances of extra segments, syllables, and words. In summary, the use of ASR software seems to have potential for L2 pronunciation teaching and learning, while the fact that ASR was shown to mirror L1 listeners’ responses for only two of the four speakers requires further investigation.
6. Conclusion
In their paper, Derwing et al. (Reference Derwing, Munro and Carbonaro2000: 600) stressed that in order to be a useful tool for feedback, ASR should “misunderstand the learner’s input whenever a human listener would misunderstand and in the same way that a human would.” Although the current study provided only partial evidence that ASR can identify L2 speech in the same way L1 listeners do (depending on individual speakers and the type of oral speech), some of the findings are promising. Yet, the results are limited to the evaluation of the production of four Taiwanese lower-intermediate EFL learners by 12 L1 English listeners who reported familiarity with English accented speech. It is therefore essential that future studies expand this scope of research to other languages (L1 and L2) and to different proficiency levels. It would also be interesting to see whether ASR recognition is more similar to intelligibility judgments by listeners less familiar with accented speech. Indeed, a more comprehensive understanding of how accurate ASR is at choosing the same errors as human listeners would enable us to objectively assess the extent to which ASR can be a useful tool for pronunciation learning. Demonstrating that accented speech is identified in similar ways by ASR and L1 listeners has the potential to increase learners’ faith in ASR technology, motivation, and autonomous learning.
Ethical statement and competing interests
The study was performed following institutional ethical guidelines. The confidentiality and anonymity of the participants was maintained throughout the study and the consequent analysis. The authors declare that there are no conflicts of interest.
About the authors
Solène Inceoglu is a senior lecturer in the School of Literature, Languages and Linguistics at the Australian National University. She received her PhD in Second Language Studies from Michigan State University. Her research focuses on second language acquisition, second language speech perception/production, pronunciation instruction, and psycholinguistics.
Wen-Hsin Chen is an assistant professor in the Language Center at National Central University, Taoyuan, Taiwan. Her research interests include second language acquisition, language classroom interaction, language processing, and Chinese language and culture.
Hyojung Lim is an associate professor in the Department of English Language and Industry at Kwangwoon University, Seoul. Her research interests revolve around second language vocabulary acquisition, computer-assisted language learning, and language testing.
Author ORCIDs
Solène Inceoglu, https://orcid.org/0000-0002-9571-4684
Wen-Hsin Chen, https://orcid.org/0000-0003-2296-118X
Hyojung Lim, https://orcid.org/0000-0001-7998-6500