Automated assessment of second language comprehensibility: Review, training, validation, and generalization studies

Kazuya Saito; Konstantinos Macmillan; Magdalena Kachlicka; Takuya Kunihara; Nobuaki Minematsu

doi:10.1017/S0272263122000080

Automated assessment of second language comprehensibility: Review, training, validation, and generalization studies

Published online by Cambridge University Press: 28 March 2022

Kazuya Saito

Konstantinos Macmillan ,

Magdalena Kachlicka

Takuya Kunihara and

Nobuaki Minematsu

Show author details

Kazuya Saito*: Affiliation:
University College London, London, UK
Konstantinos Macmillan: Affiliation:
Birkbeck, University of London, London, UK
Magdalena Kachlicka: Affiliation:
Birkbeck, University of London, London, UK
Takuya Kunihara: Affiliation:
University of Tokyo, Tokyo, Japan
Nobuaki Minematsu: Affiliation:
University of Tokyo, Tokyo, Japan
*: *Corresponding author. Email: [email protected]

Article contents

Abstract
Introduction
Motivation for the Current Study
Study 1: Model Training Phase
Results
Study 2: Model Validation Phase
Method
Results
Study 3: Generalization Study
Results
Discussion
Conclusion
Supplementary Materials
Data Availability Statement
Footnotes
References

Rights & Permissions

Abstract

Whereas many scholars have emphasized the relative importance of comprehensibility as an ecologically valid goal for L2 speech training, testing, and development, eliciting listeners’ judgments is time-consuming. Following calls for research on more efficient L2 speech rating methods in applied linguistics, and growing attention toward using machine learning on spontaneous unscripted speech in speech engineering, the current study examined the possibility of establishing quick and reliable automated comprehensibility assessments. Orchestrating a set of phonological (maximum posterior probabilities and gaps between L1 and L2 speech), prosodic (pitch and intensity variation), and temporal measures (articulation rate, pause frequency), the regression model significantly predicted how naïve listeners intuitively judged low, mid, high, and nativelike comprehensibility among 100 L1 and L2 speakers’ picture descriptions. The strength of the correlation (r = .823 for machine vs. human ratings) was comparable to naïve listeners’ interrater agreement (r = .760 for humans vs. humans). The findings were successfully replicated when the model was applied to a new dataset of 45 L1 and L2 speakers (r = .827) and tested under a more freely constructed interview task condition (r = .809).

Type: Methods Forum
Information: Studies in Second Language Acquisition , Volume 45 , Issue 1 , March 2023 , pp. 234 - 263

DOI: https://doi.org/10.1017/S0272263122000080 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Open Practices: Open materials
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Introduction

Adult second language (L2) speech is generally foreign-accented due to a range of factors, such as the influence of first language (L1) phonetic systems (Flege & Bohn, Reference Flege, Bohn and Wayland2021), perceptual-cognitive aptitude (Saito, Reference Saito and Akiyama2017), and identity (Sung, Reference Sung2016). Thus, many scholars have emphasized the importance of setting realistic goals for adult L2 learners, prioritizing understanding over nativelikeness (Munro & Derwing, Reference Munro and Derwing1995). There is ample evidence that many L2 speakers are perceived as sufficiently comprehensible regardless of foreign accentedness (Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012); and L2 speakers can continue to enhance aspects of speech affecting comprehensibility as long as they use the target language, receive feedback, and strive to improve with a view of successful communication (Saito & Akiyama, Reference Saito and Akiyama2017). Although teachers play an important role in providing immediate feedback, helping students understand their own comprehensibility, and promoting autonomous L2 speech learning, such a resource (provision of feedback) is limited in foreign language classrooms (Muñoz, Reference Muñoz2014). In the field of speech engineering, there is a growing amount of research attention toward the development of more robust automatic recognition of not only controlled but also spontaneous L2 speech, and the application of the technology to L2 speech training and evaluation (Fu et al., Reference Fu, Chiba, Nose and Ito2020). Interfacing perspectives from education and speech sciences, we took a first step toward training, validating, and generalizing the automatic assessment of comprehensibility in the context of 190 L1 and L2 speakers of English.

Second Language Comprehensibility

Given that technology allows people to interact worldwide regardless of physical constraints using videoconferencing tools and social networking, attaining adequate L2 speech proficiency is considered a key skill in academic, business, and social settings. On the one hand, some L2 users strive to attain nativelike proficiency (Scales et al., Reference Scales, Wennerstrom, Richard and Wu2006) and traditional teaching syllabi highlight native speakers as an ideal instructional model (Foote et al., Reference Foote, Holtby and Derwing2011). On the other hand, research has convincingly shown that few postpubertal L2 learners can attain nativelike phonological accuracy and fluency as their L2 system builds on and thus inevitably interacts with their already-developed first language system (Flege & Bohn, Reference Flege, Bohn and Wayland2021). When it comes to English, attaining nativelike phonological proficiency is arguably unnecessary as most interactions take place between L2 users (Pennycook, Reference Pennycook2017). Thus, a number of scholars have pointed out the importance of setting more realistic goals for postpubertal L2 speech learning, such as the enhancement of comprehensibility rather than the attainment of nativelike proficiency (Munro & Derwing, Reference Munro and Derwing1995).

From a methodological point of view, simulating behaviors in real-life conversation, L2 comprehensibility is operationalized as listeners’ intuitive judgments of spontaneous L2 speech on a 9-point scale (1 = difficult to understand, 9 = easy to understand). Comprehensibility is thought to represent the amount of listener effort necessary to understand the speakers’ message despite the degree of foreign accentedness; and the process but not product of listeners’ understanding (for the further discussion on “intelligibility” rather than “comprehensibility” as a barometer of actual understanding and its terminological and methodological issues, see Levis, Reference Levis2018). According to Derwing and Munro’s seminal work, accented L2 speech can be perceived to be comprehensible as certain phonological errors do not hinder listeners’ understanding (e.g., Munro & Derwing, Reference Munro and Derwing1995). Since Munro and Derwing (Reference Munro and Derwing1995), there have been a number of follow-up studies showing that listeners’ understanding is negatively influenced by certain (but not all) phonological errors, such as the mispronunciation of segmentals with high functional load (Suzukida & Saito, Reference Suzukida and Saito2019), melodic inaccuracies (Kang et al., Reference Kang, Rubin and Pickering2010), and dysfluencies (Suzuki & Kormos, Reference Suzuki and Kormos2020; for a meta-analysis of the phonological correlates of L2 comprehensibility, see Saito, Reference Saito2021). Additionally, certain listeners likely assign higher and thus more lenient comprehensibility ratings when they have more familiarity with foreign accents (Kennedy & Trofimovich, Reference Kennedy and Trofimovich2008), linguistic training (Saito et al., Reference Saito, Trofimovich and Isaacs2017), pedagogical experience (Isaacs & Thomson, Reference Isaacs and Thomson2020), and a greater level of awareness of the importance of L2 comprehensibility (rather than accentedness) (Saito et al., Reference Saito, Tran, Sun, Magne and Ilkan2019).

From a theoretical perspective, it is important to note that comprehensibility can serve as an index of adult L2 speech development. As stated in the interaction account of L2 acquisition (Mackey, Reference Mackey2012), language learning takes place precisely when L2 speakers actively participate in conversational interactions and end up in communication breakdowns due to linguistic errors. L2 speakers work together with interlocutors when comprehensibility is not sufficient by relying on a range of negotiation-for-meaning behaviors, such as clarification requests, and comprehension and confirmation checks. This whole sequence is hypothesized to help L2 speakers become more comprehensible, functional, and proficient users of the target language (for an empirical evidence, see Saito & Akiyama, Reference Saito and Akiyama2017).Footnote ¹ According to longitudinal (Derwing & Munro, Reference Derwing and Munro2013) and cross-sectional investigations (Saito, Reference Saito2015a), L2 learners tend to show quick improvements in comprehensibility and accentedness within the first few years of immersion. Whereas attaining nativelike L2 speech proficiency may be limited to certain individuals with earlier ages of acquisition (Saito, Reference Saito2015b), linguistically similar L1 backgrounds (e.g., Bongaers et al., Reference Bongaerts, van Summeren, Planken and Schils1997), and/or special language aptitude (e.g., Hu et al., Reference Hu, Ackermann, Martin, Erb, Winkler and Reiterer2013 for phonemic coding; Kachlicka et al., Reference Kachlicka, Saito and Tierney2019 for perceptual acuity), many L2 learners can continue to enhance the comprehensibility aspects of their speech over time as long as they use a target language on a regular basis. Similar learning patterns have been observed in various foreign language classroom settings (e.g., Nagle, Reference Nagle2018).

Due to the significant amount of practical and theoretical relevance, many scholars have promoted comprehensibility as an ecologically valid, more realistic target for adult L2 speech assessment not only in high-stakes testing settings but also for practitioners (Isaacs et al., Reference Isaacs, Trofimovich and Foote2017). However, eliciting listeners’ comprehensibility judgments is a time-consuming task (e.g., about 1 hour for rating 70 samples in Derwing & Munro, Reference Derwing and Munro2013). To this end, where some researchers use shorter stimuli to avoid listener fatigue (e.g., 4.5–10.5 seconds in Derwing & Munro, Reference Derwing and Munro1997), others recommend collecting L2 rating data using online platforms to reduce the burden on researchers (e.g., Nagle & Rehman, Reference Nagle and Rehman2021 for Amazon Mechanical Turk). Therefore, due to the time-consuming nature of the assessment procedure, a growing amount of attention has been given to the idea of using automated L2 comprehensibility scoring (O’Brien et al., Reference O’Brien, Derwing, Cucchiarini, Hardison, Mixdorff, Thomson, Strik, Levis, Munro, Foote and Levis2018). To this end, the current study took an exploratory approach toward examining this topic.

Automatic Assessment of Second Language Speech

To date, research has provided a range of pedagogic techniques that teachers can use (Pennington, Reference Pennington2021) and online learning tools that learners can access (Loewen et al., Reference Loewen, Crowther, Isbell, Kim, Maloney, Miller and Rawal2019). To make the most of such opportunities, provision of online feedback plays a critical role for many reasons. First and foremost, learners need to understand the current state of their L2 proficiency, compare it with their target, and make an effort to fill in the gaps accordingly (Lyster & Saito, Reference Lyster and Saito2010). Without such adequate self- and teacher-assessments, learners may either overestimate or underestimate their own L2 proficiency and struggle to decide the best course of action, slowing down acquisitional processes (Trofimovich et al., Reference Trofimovich, Isaacs, Kennedy, Saito and Crowther2016).

Although feedback is integral to successful L2 speech learning, the provision of feedback is a resource-intensive task, especially in foreign language classrooms. Not only are students’ L2 practice opportunities limited to several hours of weekly classroom instruction (which is typically devoid of conversation activities; Nishino & Watanabe, Reference Nishino and Watanabe2008) but teachers also lack ample time to record, listen to, and provide feedback on each student’s speech, given that they often teach many students at one time (Muñoz, Reference Muñoz2014). Thus, many researchers and practitioners alike are interested in the introduction of automated assessment as a pedagogical tool for optimal L2 speech education (O’Brien et al., Reference O’Brien, Derwing, Cucchiarini, Hardison, Mixdorff, Thomson, Strik, Levis, Munro, Foote and Levis2018).

As summarized in Table 1, a range of studies have been conducted to examine the extent to which automated scoring can simulate human listeners’ assessments of L2 oral proficiency. In terms of method, scholars have primarily drawn on correlational analyses. First, L2 speech stimuli were collected using a range of speaking tasks (controlled, spontaneous). Then, the stimuli were evaluated by trained and untrained human listeners for various aspects of L2 oral proficiency rubrics (e.g., pronunciation accuracy, overall speaking proficiency, perceived fluency and comprehensibility). Finally, the stimuli were submitted to a range of automated speech measures (phonological and fluency analyses). To examine the potential of automated L2 speech assessment, researchers explored the strength of the associations between human L2 oral proficiency ratings and automated speech scores. With respect to automated speech measures, there is much methodological variation in the primary studies. Although each study adopted different measures, they could be roughly categorized into three subgroups with a view of methodological syntheses—that is, fluency, melodic, and phonological measures. Relevant citations in the text that follows derive from Table 1 (i.e., a summary of automated L2 speech assessment studies).

Table 1. Summary of 11 key studies on automated L2 speech assessment

^a Features Experiment 2.

^b Serves as a pilot study for the current study (see the “Method” section).

Fluency Measures

This subcategory refers to a set of outcome measures that tap into the temporal characteristics of speech. As stated in Tavakoli and Skehan’s (Reference Tavakoli, Skehan and Ellis2005) model, such fluency features can be categorized as speed (e.g., speech and articulation rate), breakdowns (e.g., filled and unfilled pause frequency), and repairs (e.g., repetition and self-correction frequency). Whereas some studies covered all three of these aspects of fluency (e.g., Cucchiarini et al., Reference Cucchiarini, Strik and Boves2000 for speed, breakdown, and repair), other studies focused on one dimension (e.g., Neumeyer at al., Reference Neumeyer, Franco, Digalakis and Weintraub2000 for speed). Within the same subgroup (e.g., breakdown), the measures were operationalized differently (e.g., Cucchiarini et al., Reference Cucchiarini, Strik and Boves2000 for the frequency of silent pauses [defined as > 20ms] per sample vs. Kang & Johnson, Reference Kang and Johnson2018 for the frequency of silent pauses [defined as > 100ms]).

Melodic Measures

This subcategory refers to a set of outcome measures related to the varied use of pitch and intensity to mark stress and intonation. Some scholars have adopted different types of pitch measures, such as tone choices and relative pitch (Kang & Johnson, Reference Kang and Johnson2018). Others have also adopted both pitch and amplitude measures, such as pitch and intensity variation (e.g., van Santen et al., Reference van Santen, Prud’hommeaux and Black2009). It is important to acknowledge that we do not yet know how to conduct the automated analyses of word and sentence stress accuracy. In the field of L2 speech, such analyses require linguistically trained coders to examine how each instance of word and sentence stress has been marked (with higher pitch, longer duration, and/or greater intensity) in a contextually appropriate manner (e.g., Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012 for the manual analyses of word stress and intonation accuracy).

Phonological Measures

Thanks to advancements in automatic speech recognition (ASR) technology, one notable change in this line of work concerns the development, sophistication, and application of machine algorithms to calculate phonetic accuracy scores (i.e., qualitative measures; Zechner & Evanini, Reference Zechner and Evanini2020). In ASR, speech classes are generally defined in a more sophisticated way compared to linguistically driven phonemes. Speech samples are first segmented into small frames (e.g., approximately 20 ms) and converted into spectrum (i.e., different frequencies) using a Fast Fourier Transform. Because acoustic realization of phoneme /x/ depends on its phonemic environment, /x/ is further divided into a full set of /a/–/x/+/b/. This means acoustically realized phoneme /x/ produced after /a/ and before /b/. When the number of linguistically driven phonemes is $ N $ , the full set of /a/–/x/+/b/ has $ {N}^2 $ variants. All of them are considered as the speech classes related to /x/.

In the classical ASR framework based on Hidden Markov Models (HMM), a number of spectrograms were collected from thousands of L1 speakers to build an acoustic model for each speech class. Because the spectrogram was used as a fundamental speech representation, it inevitably conveyed a range of extralinguistic factors, such as speaker, age, gender, and microphone. These factors in turn clouded the accuracy of ASR.

Recently, a more sophisticated framework for ASR has been used based on Deep Neural Networks (DNN). Various frameworks have been proposed with DNN and one of them uses posteriorgram (rather than spectrogram) as a fundamental speech representation. A posteriorgram is defined as a temporal sequence of posterior probability distributions of classes, demonstrating the probability that an input observation belongs to a particular class. Unlike spectogram the extralinguistic factors are well suppressed. Whereas the spectrograms of “hello” generated by a male speaker and a female speaker can be substantially different, the posteriorgrams of the same speech samples can be indistinguishable. If we use the linguistically driven phonemes as the speech classes, a posteriorgram can be viewed as a probabilistic version of its phonemic transcript (i.e., how similar the speech signals are to each phonemic class). Unlike the spectrogram (which provides an acoustic representation of speech), a posteriorgram can serve as a linguistic or phonetic representation of speech.Footnote ²

Once ASR models are established in a target language, they are applied to analyze the acoustic profiling of L2 speech data. Looking at the posteriorgram distance between L1 and L2 speech data, for example, some studies have explored the overall acoustic similarity of L2 realizations of phonemes to their L1 equivalents. For a more detailed account of posteriorgram-based L1 versus L2 phoneme categorizations, see Shen et al. (Reference Shen, Yasukagawa, Saito, Minematsu and Saito2021) and Supporting Information-B. To further increase the accuracy of ASR, another intriguing idea concerns evaluating the quality of L2 speech using L1 and L2 corpus data. For example, the gap among Japanese speakers of English in Fu et al. (Reference Fu, Chiba, Nose and Ito2020) was assessed using the outcomes from ASR trained on L1 Japanese data and those from L2 English data. L2 English speech with low word-error rate in both L1 Japanese and L2 English ASR (the gap is nil or small), could be considered highly proficient. In contrast, if word error rate was low in L1 Japanese ASR but high in L2 English ASR (the gap is large), such samples could be considered less proficient.

Overall Findings

Once again, we would like to remind the readers that the primary studies operationalized both L2 speech ratings and automated measures in a substantially different fashion. Thus, the intention of the methodological synthesis is to provide overall patterns. All in all, while human listeners’ agreement is generally strong (r = .7–.9), the relationship between human and automated scoring is comparable (r = .6–.8). Earlier studies demonstrated strong correlations between human and machine evaluations of L2 speech elicited from controlled speech tasks (e.g., word and sentence reading). More recent studies have explored the replicability of the findings, focusing on more spontaneous speech samples (e.g., monologues, interviews, picture descriptions). Thus far, scholars have convincingly shown that L2 speech proficiency scores were strongly associated with the results of automated analyses of temporal features in L2 speech (e.g., Ginther et al., Reference Ginther, Dimova and Yang2010).

Motivation for the Current Study

Throughout the past 15 years, many scholars have extensively examined what contributes to naïve listeners’ perceptions of foreign-accented yet sufficiently comprehensible speech (Munro & Derwing, Reference Munro and Derwing1995). Such intuitive judgments of comprehensibility, intelligibility, and communicative competence (rather than nativelikeness) are believed to play a key role in communicative success in today’s globalized society in which most communication in English takes place between L2 speakers (Pennycook, Reference Pennycook2017). From a theoretical standpoint, comprehensibility serves as a crucial index of adult L2 speech development as learners’ oral proficiency continues to become comprehensible rather than nativelike as a function of increased practice and immersion experience (Derwing & Munro, Reference Derwing and Munro2013; Saito, Reference Saito2015a, Reference Saito2015b). The quick, reliable, and automatic assessment of L2 comprehensibility has been strongly called for among practitioners (to help students achieve comprehensible L2 speech via feedback; Trofimovich et al., Reference Trofimovich, Isaacs, Kennedy, Saito and Crowther2016) and researchers (to assess the different stages of adult L2 speech learning; Isaacs et al., Reference Isaacs, Trofimovich and Foote2017).

To advance the agendas of L2 comprehensibility and automatic assessment research, the current study serves as a first attempt to explore the extent to which machine-based assessments can simulate naïve listeners’ comprehensibility ratings of 100 L1 and L2 speakers’ semispontaneous speech (Study 1). Subsequently, we further delved into the validity, replicability, and generalizability of the regression model when it was applied to new speakers (Study 2); and tested under a more freely constructed interview task condition (Study 3).

Whereas some prior work has demonstrated the potential of automated L2 speech assessment (e.g., r = .6–.7 for automated vs. human assessments in Table 1), they have exclusively focused on controlled speech where transcripts are available. The current investigation looks at the potential of automated assessment in spontaneous speech. Although L2 speech research has shown that listeners attend to segmental, melodic, and temporal information while judging the quality of foreign-accented speech, most of the existing studies have adopted either phonological, melodic, or fluency measures. Based on the methodological synthesis presented in the preceding text (cf. Table 1), the current study adopted all the measures (phonological, melodic, and fluency) within the same model. As revied earlier, these measures included (a) speed and breakdown analyses for the fluency measures (e.g., Cucchiarini et al, Reference Cucchiarini, Strik and Boves2000 for articulation rate and pause ratio); (b) variation analyses for the melodic measures (e.g., van Santen et al., Reference van Santen, Prud’hommeaux and Black2009 for pitch and intensity variation); and (c) posterior-based analyses for the phonological measures (e.g., Shen et al., Reference Shen, Yasukagawa, Saito, Minematsu and Saito2021 for posterior probabilities and gaps).

Research Questions

The following three research questions and predictions were formulated:

1. To what degree can automatic comprehensibility scoring simulate native listeners’ intuitive judgments of various levels of comprehensibility (Study 1)?
2. Can automated fluency scoring predict different levels of comprehensibility when applied to new datasets (Study 2)?
3. Can automated fluency scoring predict different levels of comprehensibility when applied to new task contexts (Study 3)?

Predictions

As shown in the previous studies (e.g., r = .7–.8; Saito, Reference Saito2021 for a meta-analysis), about 50–60% of the variance in L2 comprehensibility can be explained by a range of manual phonological measures (linguistically trained coders’ analyses of segmental and melodic accuracy and temporal fluency). Thus, we expected to find relatively large correlation coefficients (r = .7–.8) between the automated phonological (maximum posterior probabilities, posterior gaps to natives), melodic (pitch and intensity variation), and fluency measures (articulation rate, pause ratio) and native listeners’ L2 comprehensibility judgments (R1). It was also predicted that such findings can be replicated under new speaker (R2) and task conditions (R3).

Study 1: Model Training Phase

Speakers

As a part of the investigators’ larger project, the team established a speech dataset of 1,000+ Japanese learners with a wide range of proficiency and experience levels both in Japan (i.e., beginner-to-intermediate learners with relatively limited opportunities to use L2 in English-as-a-Foreign-Language settings) and in Canada (i.e., intermediate-to-advanced learners who use their L2 on a daily basis in English-as-a-Second-Language settings). Their speech was elicited using a range of speaking tasks. The dataset was derived from a series of L2 speech projects that the team has been working on over the past 10 years; parts of the cross-sectional and longitudinal analyses have been reported elsewhere (Saito, Reference Saito2015a, Reference Saito2015b for experienced Japanese speakers; Saito & Hanzawa, Reference Saito and Hanzawa2016 for inexperienced Japanese speakers).

Given the length of time required to collect comprehensibility judgments (e.g., more than one hour including explanation, training, and rating for 50 30-sec samples), to avoid listener fatigue, only samples of a single speaking task (picture description) from 100 speakers were used. Four subgroups of speakers were carefully selected to represent various levels of comprehensibility. While all of them started learning L2 English in Japan from Grade 7 (13–14 years of age), they differed substantially in terms of the length and timing of their immersion experience.

In conjunction with both cross-sectional and longitudinal evidence of a significant relationship between increased Length of Residence (LOR) and enhanced comprehensibility (e.g., Derwing & Munro, Reference Derwing and Munro2013; Saito, Reference Saito2015a), participants’ LOR was taken into consideration to create three subgroups (i.e., inexperienced, moderately experienced, and highly experienced Japanese speakers). Of course, LOR may not always reflect the amount of input that L2 learners have received. For example, the frequency of L2 use may widely vary among L2 speakers even if they stay in an L2-speaking country for the same period, with some choosing to use only L1 rather than L2 (for relevant discussion, see Flege & Bohn, Reference Flege, Bohn and Wayland2021). To use LOR as an index of L2 experience for Moderately and Highly Experienced Japanese Speakers, efforts were made to recruit only those who reported English (rather than Japanese) as their main language of communication at work and/or home. As shown in our precursor projects (e.g., Saito, Reference Saito2015a), LOR served as a significant predictor of L2 comprehensibility. In essence, the current dataset comprised Japanese speakers with substantially different levels of comprehensibility.

• Inexperienced Japanese Speakers of English (n = 10): This group represents low-level L2 comprehensibility. All the participants were first-year university students in Tokyo (M age = 20.4 years; Range = 18–21 years). While all of them were registered for three hours of English classes a week at the time of the project, they reported little conversational use of the language outside the classroom. None of them had any immersion experience abroad.
• Moderately Experienced Japanese Speakers of English (n = 40): This group represents mid-level L2 comprehensibility. The participants were residents in Canada (M age = 34.7 years; Range = 22–48 years). Their lengths of immersion varied widely (M = 1.4 years; Range = 0.1 to 5 years). They were considered late bilinguals as they had moved to major cities in Canada (Vancouver, Montreal, and Calgary) after puberty (M age = 28.3 years; Range = 19–40 years).
• Highly Experienced Japanese Speakers of English (n = 40): This group represents high-level L2 comprehensibility. All of them were late bilinguals with their ages of arrival being after puberty (M age = 27.1 years; Range = 21–36 years) and were long-term residents with at least six years of immersion experience in Canada (M = 11.3 years; Range = 6–18 years). In addition, they reported extensive and regular use of L2 English as their primary language of communication in various settings. Following the standard in SLA research (see DeKeyser, Reference DeKeyser2013), they were considered highly experienced attainers who had reached the upper range of L2 speech proficiency.
• L1 Baseline (n = 10): This group serves as a L1 speaker baseline. A total of 10 L1 speakers of English were recruited in Vancouver, Canada (M age = 27.5 years; Range = 18–37 years). They reported English as their L1 from birth onward with both of their parents being L1 English speakers.

Speaking Task

All the participants completed a timed picture description task. Building on a similar task procedure (Munro & Mann, Reference Munro and Mann2005), participants described seven different pictures with five seconds of planning time for each photo. Three key words were provided per photo to help low proficiency speakers to produce sufficient lengths of spontaneous speech without too much dysfluency due to insufficient vocabulary knowledge. The first four pictures were used as practice for participants to get used to the task procedure (describing a photo in English with minimal planning), and the last three pictures (A, B, and C) were used for the final analyses.

The pictures depicted (a) a table left out in driveway (key words: rain, table, and driveway), (b) three men playing rock music with guitars (keywords: three guys, guitar, and rock music), and (c) a long road under a cloudy blue sky (keywords: blue sky, road, and cloud). These key words were selected to elicit segmental, melodic, and syllabic structures that are especially difficult for Japanese learners of English. For example, Japanese speakers are likely to neutralize the English /r/-/l/ contrast (“rain, rock, brew, crowd” vs. “lane, lock, blue, cloud”) and substitute borrowed words (i.e., Katakana) by inserting epenthesis vowels between consecutive consonants (/dəraɪvə/ for “drive,” /θəri/ for “three,” /səkaɪ/ for “sky”) and after word-final consonants (/teɪbələ/ for “table,” /myuzɪkə/ for “music”).

All the samples were recorded at individual meetings that took place at a community center, a university lab, and participants’ residences (prior to the COVID-19 pandemic) using digital Roland-05 audio recorders at 44.1 kHz sampling rate with 16-bit quantization.

Comprehensibility Judgments

Listeners

A total of 10 L1 listeners of General American English were recruited online (M _age = 24.8 years). They were all born in the United States and raised by monolingual parents. While all of them held a BA and/or MA degree, none of them majored in linguistics. Using a 6-point scale (1 = not at all, 6 = very much), they all reported that they were strongly familiar with foreign accented English speech (M = 5.8, Range = 4–6) but their familiarity with Japanese-accented English varied (M = 3.6, Range = 1–6). In line with Isaacs and Thomson’s (Reference Isaacs and Thomson2020) categorization, these listeners can be considered naïve rather than expert. Following Munro and Derwing (Reference Munro and Derwing1995), L2 comprehensibility was operationalized as naïve listeners’ intuitive judgments of spontaneous speech.

Procedure

Due to the pandemic, all of them completed the rating sessions individually with an investigator using a video-conferencing tool. All the rating sessions were conducted through the Gorilla platform for online research (Anwyl-Irvine et al., Reference Anwyl-Irvine, Massonnié, Flitton, Kirkham and Evershed2020). During the data collection, the investigator engaged with the participants to provide training, monitor their rating behaviors, respond to any enquiries, and help solve any technological problems. First, the listeners familiarized themselves with the listening materials (three picture descriptions by 100 L1 and L2 speakers) and the online rating procedure using Gorilla. Then, each listener received instruction on what characterized comprehensibility using the following definition:

• This term refers to how much effort it takes to understand what someone is saying. If you can understand (what the picture story is about) with ease, then a speaker is highly comprehensible. However, if you struggle and must listen very carefully, or in fact cannot understand what is being said at all, then a speaker has low comprehensibility.

All the samples were played in a randomized order. Upon hearing each sample once, they made an intuitive judgment using a 9-point scale (1 = difficult to understand, 9 = easy to understand). No repeat button was provided.

First, the listeners practiced the rating procedure using three samples (not included in the main dataset). By listening to the speech dataset that the research team established (beyond the current study), the author carefully identified the three practice samples that represented the three subgroups of Japanese speakers (inexperienced, moderately experienced, highly experienced). After the practice session was completed, the listeners proceeded to judging the comprehensibility of the 100 samples. Each session took around 2 hours with a 5-minute intermission a halfway through. In accordance with the recommended quality control measures for online L2 rating data collection (Nagle & Rehman, Reference Nagle and Rehman2021), the platform was designed so that the listeners had to listen to the full-length of each sample (30 sec) before they rated the comprehensibility. The entire secession was carefully monitored by the investigator using a video-conferencing tool.

Automated Fluency Measures

Following Tavakoli and Skehan’s (Reference Tavakoli, Skehan and Ellis2005) model of fluency, and the automated fluency measures adopted in Cucchiarini et al. (Reference Cucchiarini, Strik and Boves2000), the two major temporal dimensions of L2 speech were measured: speed (speech rate), and breakdown (pause ratio). The repair aspect of fluency (the ratio of self-correction and repetition) was not taken into consideration as recent literature has suggested it is a trait of first language fluency (Duran-Karaoz & Tavakoli, Reference Duran-Karaoz and Tavakoli2020) or/and cognitive individual differences (Zuniga & Simard, Reference Zuniga and Simard2019).

Using the “To TextGrid (silences)” function in Praat (Boersma & Weenink, Reference Boersma and Weenink2019), pauses were detected as silences longer than 250 ms. The silence threshold (i.e., maximum intensity) was set to –20 dB as some speech samples included background noise. Next, the number of syllables was calculated based on de Jong and Wempe’s (Reference de Jong and Wempe2009) Praat script. Nuclei were detected when the following phonetic conditions were met: (a) peak intensity was 2 dB above the median intensity, and (b) it was preceded and followed by 4+ dB of dips in intensity. The Praat script did not automatically identify and remove any filled pauses and repetitions (both of which were included in phonation time). All the temporal information was used to calculate the following fluency measures:

(1) Articulation rate: This was calculated by dividing the number of syllables by the phonation time (sample duration minus all pauses).
(2) Pause ratio: This was calculated by dividing the number of unfilled pauses by sample duration.

Automated Phonological Measures

To capture the multilayered nature of comprehensibility, it is important to include not only speech features related to the quantity of phonation (fluency) but also those related to the phonological quality of pronunciation (accuracy). As reviewed earlier, a range of previous studies have adopted the posteriorgram-based analyses (DNN-HMM; Chen et al., Reference Chen, Qian and Yu2018; Zechner et al., Reference Zechner, Higgins, Xi and Williamson2009). As preparation for the current investigation, the preliminary project was conducted, wherein a range of analysis methods for posteriorgram-based data were proposed, piloted, and refined with a view of optimal L2 speech assessments (Shen et al., Reference Shen, Yasukagawa, Saito, Minematsu and Saito2021). The two posteriorgram-based analyses were adopted to assess phonological quality: (a) maximum posterior probabilities and (b) posterior gaps to natives. They were found to greatly boost the predictive power of the automatic assessment of perceived fluency (r = .8–.9 for machine vs. humans).

First, DNN models were trained with the Wall Street Journal corpus, which featured 37,416 utterances spoken by 123 L1 speakers of General American English and corresponding scripts. Because the current dataset (100 picture descriptions) included noticeable noise, it was necessary to adjust the sound clarity of the corpus. To this end, four levels of babble noises and machine noises (computer-synthesized distortions) we added to the Wall Street Journal corpus (signal-to-noise ratio = 10, 30, 40, and 50 [dB]). As a result, noise-robust English DNN-HMM acoustic models were trained based on the Wall Street Journal corpus (Povey et al., Reference Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motlicek, Qian, Schwarz, Silovsky, Stemmer and Vesely2011). Once the models were trained for ASR, any input sample can be converted to its posteriogram.

The number of phonemes used in posteiorgrams is generally large (n = 2,000–3,000) (for further discussion on the concepts of posterior-based phonemes, see Supporting information B). As proposed in Kashiwagi et al. (Reference Kashiwagi, Zhang, Saito and Minematsu2016), the Bhattacharyya distance between two states was calculated directly from their state posterior probabilities through Bayes’s theorem; the original dimension of the posteriorgrams (n = 2,000) was reduced to 50. From the posteriorgram of each utterance after pause removal, the two quality measures were calculated:

(3) Averaged Maximum Posterior Probabilities: When L1 posteriorgrams were visualized with context-independent phonemes, a posterior vector at each time appeared to be a one-hot vector. This means that the phoneme intended had a probability close to 1.0 and the others had almost 0.0. Here, from a given posteriorgram, the maximum posterior probability was calculated at each time, and then averaged over time. For example, at any point in time, there are always probability scores for 50 phoneme states. Subsequently, the maximum posterior probability score was identified out of the 50 scores, and its corresponding phoneme was assumed to indicate a speaker’s intended phoneme. The higher the average was, the more distinct the pronunciation of the utterance.
(4) Averaged Posterior Gaps to Natives: For each speaker, their averaged posterior vector was calculated. Because 10 native speakers were included in the main dataset, a gap from one learner to each native speaker was calculated to generate 10 gap scores in total. The distance was calculated via the Bhattacharyya metric. The averaged gap scores were thought to quantify the proximity of nativelikeness based on the distribution of perceived phonemes. Figure 1 visualizes the averaged posterior vectors and the posterior gaps. The former characterizes quality of pronunciation, represented as location in the feature space, and the latter characterizes distances to the 10 native speakers.

Figure 1. Conceptual summary of averaged posterior and posterior gap.

Automated Melodic Measures

As reviewed earlier, the previous literature has developed and adopted the automatic analyses of the varied use of pitch and amplitude in L2 speech (e.g., van Santen et al., Reference van Santen, Prud’hommeaux and Black2009). To assess the melodic variation in participants’ speech production, the software openSMILE (“open-source Speech & Music Interpretation by Large-space Extraction”) was used (Eyben et al., Reference Eyben, Wöllmer and Schuller2010). Given that listeners rely on both pitch and intensity contour to perceive English melody (e.g., Lieberman, Reference Lieberman1960), two directly relevant measures were adopted in the current study, that is, interquartile range values for F0 envelope and loudness:

(5) Pitch variation: After extracting fundamental frequencies (pitch frequencies) of given utterances and sorting the frequency values, the difference between 75th and 25th percentiles was calculated. The value indicated the magnitude of pitch variation or dynamics in the utterances. Monotonous speech (typical of beginner-to-intermediate L2 speakers) could be generally characterized as smaller pitch variations.
(6) Intensity variation: Similar to pitch variation, loudness was calculated as sequential data, which was defined as normalized intensity raised to a power of 0.3. After sorting the intensity values, the difference between 75th and 25th percentiles was calculated. The lack of intensity variation (and/or pitch variation) could help identify certain L2 English speakers who fail to distinguish between stressed and unstressed vowels.

Results

Listeners’ Judgments of Comprehensibility

The first objective of the analyses was to examine how 10 L1 listeners perceived the comprehensibility of 90 Japanese speakers (Inexperienced, Moderately Experienced, and Highly Experienced) and 10 L1 speakers. As stated earlier, these groups were designed to represent various levels of comprehensibility among inexperienced/experienced L2 speakers and L1 speakers. In terms of interrater reliability, the listeners demonstrated medium-to-strong agreement (r = .760). The strength of the correlations widely varied from r = .569 between Listeners 3 and 7 to r = .832 between Listeners 6 and 8 (see Table 2). As in many L2 comprehensibility studies (e.g., Munro & Derwing, Reference Munro and Derwing1995), their rating scores were averaged across to derive one comprehensibility score for each speaker as an index of the listeners’ overall ratings. According to the results of a Kolmogorov–Smirnov test, the averaged comprehensibility scores did not significantly differ from normal distribution (D = .086, p = .096). While all the participants in the L1 baseline group received 9 points, the L2 speakers’ comprehensibility scores demonstrated a great deal of individual variation (M = 5.01, SD = 1.81, Range = 1.3–8.5).

Table 2. Interclass correlations among 10 listeners’ comprehensibility judgments (Study 1)

Automatic Assessments of Comprehensibility

The second objective of this study was to examine the extent to which L1 listeners’ comprehensibility judgments can be tied to a set of automated phonological, melodic, and fluency measures (for descriptive statistics, see Table 3). The results of the normality test (Kolmogorov–Smirnov) demonstrated that whereas most of their fluency, phonological, and melodic scores were comparable to normal distribution (p > .05), their posterior gap scores were significantly different from normal distribution (D = .202, p < .001). Because a severe positive skewness was observed, the posterior gap scores were submitted to inverse transformation, resulting in the scores becoming normally distributed (D = .080, p = .515). As such, the distance measure linearly represents the degree of proximity (i.e., nativelikeness). Given that both the dependent (comprehensibility ratings) and predictor variables (fluency, phonological, and melodic measures) followed normal distribution, only linear models were considered in the subsequent statistical analyses.

Table 3. Descriptive statistics of automated measures

To examine how the automated assessments predicted different levels of comprehensibility (treated as ordinal values), a multiple regression model was constructed using the lm function in the R statistical environment (The R Foundation, n.d.). The model comprised listeners’ averaged comprehensibility scores as dependent variables relative to the six automated measures (articulation rate, pause ratio, maximum posterior probabilities, inversed posterior gaps, pitch, and intensity variability). As for model selection, “a full model” was chosen to maximize the predictive power of the model (with all six predictors entered in the equation). In the current analyses, we did not choose stepwise models (backward or forward selection based on the results of F tests) due to the following criticisms. First, the multiple comparisons conflated the occurrences of type 1 and 2 errors. Second, stepwise models are prone to overfitting the data and underestimating the degrees of freedom (for guidelines for the use of multiple regression analyses in applied linguistics research, see Larson-Hall, Reference Larson-Hall2010).

As summarized in Table 4, the following model was tested: Listeners’ averaged comprehensibility scores = Intercepts + articulation rate + pause ratio + maximum posterior probabilities + inversed posterior gaps + pitch + intensity variability. The full model significantly explained 67.7% of the variance in the listeners’ comprehensibility judgments, F(6, 93) = 32.507, p < .001, without any clear evidence of multicollinearity problems (Variance Inflation Factor [VIF] = 1.016–1.456). In terms of the standardized β values, perceived comprehensibility was mainly predicted by segmental quality (β = .607 for posterior gaps, β = .267 for max posterior probabilities), secondarily by temporal quality (β = –.193 for pause ratio, β = .123 for articulation rate), and finally by melodic quality (β = .113 for intensity variation, β = –.093 for pitch variation).

Table 4. Results of multiple regression analysis using automated measures as predictors of listeners’ comprehensibility scores

The predicted comprehensibility scores were calculated based on the regression model’s coefficients in Table 4. For each predictor variable, raw predictor values (i.e., articulation rate, pause ratio, maximum posterior probabilities, inversed posterior gaps, pitch variability, and intensity variability) were multiplied by unstandardized B (i.e., 0.524, –3.920, 19.427, 0.262, –0.006, and 1.137). Then, constant (i.e., –15.371) was added to total scores (i.e., the sum of all explanatory variables). Because one L1 speaking participant yielded 9.02, this was adjusted to the upper limit, 9.0.Footnote ³ The Pearson correlation coefficients between predicted and human comprehensibility scores were relatively strong, r = .823, p < .001 (see Figure 2). The figure here was higher than the 10 listeners’ averaged interrater agreement (r = .760 among Listeners 1–10), and comparable to that of Listeners 6 and 8 who demonstrated the strongest agreement (r = .832 for Listeners 6 vs. 8). According to the results of an independent t-test, the predicted and human comprehensibility scores did not significantly differ, t = .002, p = .998, d = 0.01.

Figure 2. The relationship between human comprehensibility scores and predicted comprehensibility scores (r = .823).

Study 2: Model Validation Phase

Overall, Study 1 demonstrated (a) that the machine scoring can successfully simulate human listeners’ comprehensibility judgments (r = .823 for machine vs. humans; r = .760 for humans vs. humans); and (b) that such automated comprehensibility assessments can be primarily determined by phonological accuracy (β = .607 for inversed posterior gaps; β = .267 for max posterior probabilities) and secondarily by fluency features (β = –.193 for pause frequency; β = .123 for articulation rate). The results here are in line with the existing literature (the relation weights of phonological accuracy over fluency in L2 comprehensibility; see Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012). Interestingly, none of the melodic measures (pitch and intensity variation) turned out to be significant predictors. This could be due to several reasons. First, the timed picture description task used in the current study may not have elicited sufficiently long sentences, limiting speakers from demonstrating varied and accurate use of melodic cues. Secondly, the relationship between pitch height measures (including pitch range) and L2 comprehensibility may be minimal (β = –.10 in Kang et al., Reference Kang, Rubin and Pickering2010). Thirdly, listeners orchestrate a range of different melodic cues (beyond pitch and intensity) to perceive, identify, and encode the targetlikeness of lexical and sentence stress in English (Lieberman, Reference Lieberman1960). Study 2 was designed to test the extent to which the automated assessments can predict different levels of L2 comprehensibility when applying the regression formula in Table 4 to a new L2 speech dataset (N = 45 timed picture descriptions).