Introduction
Currently, most nonnative users of English primarily use the language as a means of international communication. As the vast majority speaks English with some degree of nonnative accent, research on international intelligibility has flourished during the past 2 decades. Pedagogical suggestions from this field have mostly centered on aspects of sound production and perception—that is, on equipping learners with a core of pronunciation features necessary for successful international communication and improving their receptive and productive phonological accommodation skills (e.g., Deterding, Reference Deterding2013; Jenkins, Reference Jenkins2000, Reference Jenkins2002; Walker, Reference Walker2010)—for example, through increased exposure to nonnative accents. These suggestions arose from a considerable body of research that until now has strongly focused on the role of particular pronunciation features for mutual understanding in international contexts (e.g., Deterding, Reference Deterding2013; Gardiner, Reference Gardiner2019; Jenkins, Reference Jenkins2000, Reference Jenkins2002; Kang, Thomson, et al., Reference Kang, Thomson and Moran2020). That is, the issue of international intelligibility has been approached primarily from a bottom-up perspective on the listening process, the focus of attention being the quality of the acoustic signal. What has received comparatively little attention in research on international intelligibility so far is the role of top-down influences in the form of linguistic co-text and extralinguistic context. Co-text refers to the “verbal environment” of an utterance—that is, “the accompanying text” (Halliday, Reference Halliday and Ghadessy1999, p. 3)—and relates to the linguistic relations encoded in a particular language system (Widdowson, Reference Widdowson and Hogan2011)—for example, in the form of collocations and lexical or morphosyntactic relations. Context refers to “the extralinguistic circumstances in which language is produced” (Widdowson, Reference Widdowson and Hogan2011, p. 221) and, besides the immediate situational context, also involves the sociocultural context that language users bring with them (Widdowson, Reference Widdowson and Hogan2011). The importance of co-textual and contextual information for linguistic comprehension is generally acknowledged by researchers: It is well established that both L1 and L2 listening are cognitively interactive processes in which listeners combine information from the acoustic signal with information from linguistic co-text and extralinguistic context (e.g., Field Reference Field2004, Reference Field2008; Goh & Vandergrift, Reference Goh and Vandergrift2021; see also Kennedy, Reference Kennedy2021). This has repeatedly been demonstrated regarding intelligibility to L1 listeners in publications on speech science, psycholinguistics, or speech pathology (e.g., Bent et al., Reference Bent, Holt, Miller and Libersky2019; Garcia & Cannito, Reference Garcia and Cannito1996; Kamide et al., Reference Kamide, Altmann and Haywood2003) and regarding L2 listening comprehension more generally (e.g., Macaro et al., Reference Macaro, Vanderplank and Graham2005). However, research on co-textual and contextual effects has been comparatively scarce concerning intelligibility to L2 listeners, notably L2 listeners in lingua franca (LF) listening situations—that is, when listening to another nonnative accent. Yet, it is precisely in such contexts that an increased need to rely on co-textual and contextual information to compensate for unexpected pronunciation patterns can be expected. The present paper seeks to fill this research gap by expanding the scope of intelligibility research on co-textual and contextual factors to English as LF (ELF) listeners. Additionally, it adopts a differentiated perspective regarding nonnative ELF listeners’ ability to rely on co-text and context, acknowledging that the latter might be affected by listeners’ proficiency level. Based on a large-scale experimental study, it shows that nonnative listeners at different levels profit from various types of co-textual and contextual cues when recognizing words spoken with another nonnative accent. Its findings highlight the need for a greater recognition of co(n)textual factors and listening proficiency as relevant variables in research on international intelligibility, both of which have so far received relatively little attention in the field (see also Kang, Moran, et al., Reference Kang, Moran, Ahn and Park2020).
Intelligibility as spoken word recognition: A cognitively interactive process
Intelligibility is a complex concept that has been defined in various ways (see also Munro & Derwing, Reference Munro, Derwing, Reed and Levis2015). The present paper is primarily interested in the level of spoken word recognition (SWR, cf. Smith’s intelligibility, Reference Smith and Kachru1992)—that is, the process of matching units in the stream of speech to a particular item stored in the mental lexicon (e.g., Magnuson, Reference Magnuson, Gaskell and Mirkovic2017). Crucially, SWR does not merely involve information from the acoustic level, which would then be processed as words, sentences, and so on (bottom-up processing, see Field, Reference Field2004) but is essentially interactive (e.g., Mirman, Reference Mirman, Gaskell and Mirkovic2017) because listeners combine bottom-up information with top-down information—that is, higher level linguistic and extralinguistic information (or, in other words, co-textual and contextual information), which then affects the processing of smaller units such as speech sounds (see Field, Reference Field2004).
The use of top-down information can take various shapes in SWR. Studies have shown that listeners draw on lexical knowledge—for example word frequency knowledge or knowledge of the phonological “neighborhood” of a word (Bradlow & Pisoni, Reference Bradlow and Pisoni1999; Luce & Pisoni, Reference Luce and Pisoni1998)—and on aspects of the visual context, such as gestures (Garcia & Cannito, Reference Garcia and Cannito1996) or referent availability (Tanenhaus et al., Reference Tanenhaus, Spivey-Knowlton, Eberhard and Sedivy1995), to identify spoken words. Additionally, numerous studies have provided evidence for the use of sentence co-text in SWR, with syntactic, semantic, and collocational constraints aiding listeners in recognizing words (Baese-Berk et al., Reference Baese-Berk, Bent and Walker2021; Behrman & Akhund, Reference Behrman and Akhund2013; Bent et al., Reference Bent, Holt, Miller and Libersky2019; Hilpert, Reference Hilpert2008; Kennedy & Trofimovich, Reference Kennedy and Trofimovich2008). Sentence co-text has also been found to facilitate SWR by triggering certain schematic expectations that make listeners anticipate the intended word (Kamide et al., Reference Kamide, Altmann and Haywood2003). Finally, Garcia and Cannito (Reference Garcia and Cannito1996) and Hustad and Beukelman (Reference Hustad and Beukelman2001) used verbal cues (e.g., “relocating to a new city”) to prime listeners schematically for certain situations or scripts, which contributed positively to SWR.
Spoken word recognition in nonnative listeners
The research discussed above was concerned with co(n)textual effects on SWR in L1 listeners, often under adverse listening conditions such as when listening to dysarthric speech (e.g., Garcia & Cannito, Reference Garcia and Cannito1996) or L2-accented speech (Baese-Berk et al., Reference Baese-Berk, Bent and Walker2021; Behrman & Akhund, Reference Behrman and Akhund2013; Bent et al. Reference Bent, Holt, Miller and Libersky2019; Kennedy & Trofimovich, Reference Kennedy and Trofimovich2008). Such research concerning L2 listeners is much harder to find, although other factors associated with intelligibility to nonnative listeners are well studied, such as the effect of various talker–listener pairings. Thus, various studies examined whether L2 listeners experience a benefit when listening to nonnative rather than native speech, either in their own accent or a different L2 accent (Bent & Bradlow, Reference Bent and Bradlow2003), but evidence regarding this phenomenon is mixed (cf. Stibbard & Lee, Reference Stibbard and Lee2006), and research suggests that its manifestation depends (at least partly) on listening proficiency and the acoustic phonetic similarity of the talker’s and listener’s accents (e.g., Pinet et al., Reference Pinet, Iverson and Huckvale2011; Stringer & Iverson, Reference Stringer and Iverson2019).
Research on co(n)textual effects on intelligibility to L2 listeners, however, is comparatively scarce, although of particular interest as results for L1 listeners in this respect may not apply to (all types of) L2 listeners. This is because nonnative listeners may be less apt at combining bottom-up and top-down information, the interactive processing of the speech signal here being less “automatic” because their linguistic knowledge is typically narrower (Goh & Vandergrift, Reference Goh and Vandergrift2021, p. 20) and not as deeply entrenched (Mauranen, Reference Mauranen, Jenkins, Baker and Dewey2018). It has thus been suggested that L2 listeners may be more dependent on acoustic information than native listeners (e.g., Cutler, Reference Cutler2012; Jenkins, Reference Jenkins2000). Crucially, this is not to say that nonnative listeners solely rely on bottom-up information in SWR. Bradlow and Pisoni (Reference Bradlow and Pisoni1999) and Yoneyama and Munson (Reference Yoneyama and Munson2010) found evidence that nonnative listeners draw on top-down lexical information (word frequency knowledge and knowledge of the phonological neighborhood of a word) when identifying spoken words. There is also evidence that nonnative listeners profit from semantic and schematic relations in the sentential co-text of a word in SWR (Lagrou et al., Reference Lagrou, Hartsuiker and Duyck2013; Mack, Reference Mack1992; see also Bradlow and Alexander, Reference Bradlow and Alexander2007, for a “clear” speaking style but not a “plain” speaking style).
Thus, nonnative listeners also seem to rely on top-down information in SWR, though not necessarily in the same way or to the same extent as native listeners, which may sometimes indeed make them more dependent on acoustic information. Still, top-down influences may play a considerable role in L2 listening, especially when they are extralinguistic. Interestingly, Wolff (Reference Wolff1987) argues that nonnative listeners may rely more heavily on top-down processing than native listeners due to problems with bottom-up processing owing to a lack of L2 knowledge (see also Field, Reference Field2004). He assumes that once a certain threshold has been attained, nonnative listeners will resort to bottom-up processing more intensively (1987, p. 313), implying that top-down processing is particularly pronounced in lower proficiency listeners. Evidence for this is provided by Koster (Reference Koster1987), who found that lexical cues improved SWR the most for the L2 listeners with the lowest proficiency in her study, compared with high-proficiency L2 listeners and L1 listeners. Similarly, Field (Reference Field2004) found evidence for the dominance of top-down knowledge over bottom-up information in lower intermediate listeners’ SWR, albeit only when the available sentence co-text was “highly constraining” (Reference Field2004, p. 373). These findings stand in stark contrast to the established assumption of “bottom-up dependency” in low-proficiency L2 listeners (Field, Reference Field2004, p. 364), with their (necessarily greater) attention to the linguistic code detracting them from exploiting co-text and context effectively (Field, Reference Field2008, p. 132), for which there is also some empirical support (e.g., Gu et al., Reference Gu, Hu and Zhang2005).
Given these somewhat inconclusive findings regarding top-down processing in low-proficiency L2 listeners, it has been suggested that their reliance on top-down information simply has a different function than in accomplished listeners—namely, a compensatory function—to supplement incomplete understanding (Field, Reference Field2008). As discussed above, native listeners rely equally on co-text and context when confronted with difficult input, so language users seem to naturally adopt this approach when encountering an incomplete or ambiguous acoustic signal. Crucially, this becomes an essential skill for both native and nonnative listeners in international contexts, where they typically encounter numerous unexpected pronunciation patterns.
International intelligibility and the co-text/context problem
From the above discussion, one might expect pronunciation to play a subordinate role for intelligibility in international communication, as nonnative listeners, particularly after having attained a certain proficiency level, should be able to rely on cues from the linguistic co-text and the extralinguistic context in SWR of another nonnative accent. In contrast, Jenkins (Reference Jenkins2000, Reference Jenkins2002) observed that in international encounters, nonnative listeners tended to have difficulty drawing on such cues when trying to understand their nonnative interlocutor, with pronunciation consequently being the number one cause of miscommunication in her data. Interestingly, the listeners she observed were by no means low-proficiency listeners but at B2–C1 level according to the Common European Framework of Reference (CEFR; see Jenkins, Reference Jenkins2002). Jenkins’ findings, which were derived from interactive and multimodal speaking tasks, led her to conclude that nonnative listeners “below the level of bilingual proficiency appear unable to process contextual cues to compensate for their interlocutors’ pronunciation errors” (Reference Jenkins2002, p. 87) and thus tend to overrely on the acoustic input. Interestingly, this stands in stark contrast to more general accounts of ELF communication that stress its situatedness in a particular context of communication and international users’ ensuing attention to aspects beyond the linguistic code (e.g., Seidlhofer, Reference Seidlhofer2011). Nevertheless, few studies have hitherto investigated the role of co-textual and contextual information for international intelligibility. Osimk (Reference Osimk2009) found that nonnative ELF listeners profit from sentential co-text in SWR, and Thir (Reference Thir, Mauranen and Vetchinnikova2021) showed how nonnative ELF listeners at a proficiency level similar to those in Jenkins (Reference Jenkins2000, Reference Jenkins2002) draw on various types of linguistic and extralinguistic cues in SWR during interactive and multimodal speaking tasks. Interestingly, in contrast to Jenkins’ participants, who tended to “adjust[] the context and/or co-text to bring them into line with the acoustic information rather than vice versa” (Reference Jenkins2002, p. 90), one listener in Thir (Reference Thir, Mauranen and Vetchinnikova2021) seemed to do the opposite, ignoring acoustic information probably in favor of an interpretation congruent with the available visual context. In addition, several studies (Kaur, Reference Kaur2011; Mauranen, Reference Mauranen2006; Pitzl, Reference Pitzl2010) did not find pronunciation problems to be a major source of miscommunication in international interactions (however, cf. Deterding, Reference Deterding2013), suggesting that many nonnative ELF listeners compensate for phonetic-phonological variation by relying on other sources of information besides the acoustic signal.
Clearly, co-text and context have the potential to play a considerable role for intelligibility among nonnative ELF listeners, but empirical findings in this respect are scarce and inconclusive. As previous research suggests that listening proficiency is a crucial variable in the use of top-down processing in nonnative listeners, a more nuanced approach that differentiates more finely between listeners at different proficiency levels seems in order. Obviously, nonnative listeners may differ vastly in their knowledge of the L2 and thus in the availability of linguistic knowledge as a top-down resource in L2 speech perception as well as their ability to effortlessly switch between top-down and bottom-up processing. This might explain the finding that listening proficiency affects listeners’ sensitivity to L2 accented speech (Kang, Moran, et al., Reference Kang, Moran, Ahn and Park2020), suggesting that listeners at different levels differ in their ability to compensate for phonological ambiguity with the help of co(n)textual cues. The present study thus examines the following research questions:
RQ 1: Do nonnative listeners at different proficiency levels (intermediate to low advanced) profit from different types of co-textual and contextual cues when identifying words spoken with another nonnative accent?
RQ 2: If yes, do they do so to a lesser extent than highly advanced nonnative listeners (i.e., those at C2 level)?
As co-text and context have received relatively little attention so far in research on international intelligibility and seem to be somewhat underestimated factors, a further objective of this study is to compare their effects to that of a variable more frequently discussed in the literature: speech familiarity (e.g., Deterding, Reference Deterding2013; Gass & Varonis, Reference Gass and Varonis1984; Kennedy & Trofimovich, Reference Kennedy and Trofimovich2008; Smit, Reference Smit2010). RQ 3 thus asks the following:
RQ 3: What is the relative influence of co-textual and contextual information on intelligibility to nonnative listeners at different proficiency levels compared to that of speech familiarity?
The types of co(n)textual cues investigated in the present study include syntactic, semantic, and schematic cues. Whereas syntactic and semantic information are here viewed as relating to linguistic co-text, schematic information is regarded as relating to language users’ cognitive sociocultural context (Widdowson, Reference Widdowson and Hogan2011). This is because schemata are knowledge structures shaped by previous and influencing new experiences of the world (Gureckis & Goldstone, Reference Gureckis, Goldstone and Hogan2011, p. 725). However, as mentioned above, schematic expectations may sometimes be evoked by certain textual elements (e.g., sentence co-text, topic cues). That is, linguistic co-text may be used to trigger a certain cognitive context within a listener, which has also been done in the present study.
Method
To attain a sufficiently large and internationally diverse sample of nonnative listeners, an online listening experiment was developed and distributed via the internet. The experiment involved four conditions in which listeners had to orthographically transcribe target words. Three of these conditions supplied listeners with co(n)textual cues. The Syn condition examined the effect of syntactic information on intelligibility. Target words were embedded in short, semantically neutral sentences that merely indicated the part of speech (POS) of the target word (given in square brackets), e.g., “It’s quite [flat].”
The Syn+Sem condition examined the effect of syntactic and semantic information on intelligibility. In addition to indicating the POS of the target word, carrier sentences included a semantic cue in the form of a prime word, which was semantically related to the target word, either in terms of a classic meaning relationship (meronomy, hyponymy, or antonymy) or, if target and prime differed in POS, via semantic entailment (i.e., the denotational meaning of the prime entailed an aspect of the target word’s meaning). For example, in “He was driving a [van],” drive entails [vehicle], which is a semantic property of the target word [van]. The target word was always preceded by the prime.
The Syn+Sch condition examined the effect of syntactic and schematic information and therefore included a schematic cue—that is, a short description of the situation in which the utterance occurred (in small caps in the following example). This cue appeared on screen before the carrier sentence was played. Carrier sentences indicated the POS of the target word but did not include any semantic primes as in the Syn+Sem condition—for example, “When getting up in the morning: I can’t find my [pants].”
The fourth condition was a control (C) condition, where target words were presented without any co-text or context. All target words and words in the carrier sentences were A1–B2 levels according to the English Vocabulary Profile (EVP; Cambridge University Press, 2012). Table 1 summarizes all target words and sentences. In each condition, the sixFootnote 1 target items were intermixed with nine distractor items. Most target words were monosyllabic to increase task difficulty and avoid ceiling effects, but each condition also included two disyllabic target words, as another part of the research project examined word-length effects on international intelligibility.
Note. Target words are in square brackets. Semantic cues are underlined, and schematic cues are in bold and small caps. Test items were not presented in the order above to avoid phonological priming effects.
Target words included either the Nurse or the Trap vowel (distributed evenly across conditions and mono- and disyllabic words), as another part of the research project compared the international intelligibility of the speaker’s realization of these two phonemes (see Thir, Reference Thir2020). For the monosyllabic Trap words,Footnote 2 each condition also contained two Trap-Dress minimal pair (MP) words (e.g., sand, land in the C condition), as the speaker’s realization of Trap as [e] neutralized this distinction, with the effect of MP status of a word thus being of interest. For Nurse, no MP words were included, as the speaker’s realization ([øə]) did not approximate another English phoneme. Each condition included two types of Trap-Dress MP words: one where the words differed in POS (e.g., bad–bed) and one where they did not (e.g., pan–pen). Footnote 3
Another factor considered in the selection of target words was word frequency, measured as the “contextual diversity” (CD) frequency count in the SUBTLEXUS corpus (Brysbaert & New, Reference Brysbaert and New2009; https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/subtlexus3.zip). The eight monosyllabic non-MP words were comparable in terms of word frequency, as determined by phi coefficients calculated with a chi-square test.Footnote 4 The same applied to the eight disyllabic target words. However, the monosyllabic non-MP words were more frequent than the disyllabic words, reflecting that word length and word frequency are inversely related (Zipf, 1935/Reference Zipf1965). Unfortunately, it was impossible to balance the MP words in terms of frequency, as priority was given to the other selection criteria (occurrence in both American and British Standard pronunciation, adequate level in the EVP), which were applied to both words in an MP to increase the likelihood that it would be perceived as MP by L2 listeners. Thus, only a small number of suitable MPs remained.
Carrier sentences in the Syn+Sem and Syn+Sch conditions were developed by the author and tested informally on and discussed with lay persons and colleagues, with a focus on their comprehensibility to L2 listeners and predictability in relation to the target words. The results suggested a satisfactory level of predictability (i.e., sentences pointed listeners to semantically or schematically appropriate words but did not render the additional audio stimulus superfluous). Carrier sentences were then tested (along with the sentences and words in the C and the Syn conditions) in a pilot study involving 97 L2 listeners form various L1 backgrounds (who were not included in the sample of this study), which further confirmed their appropriateness for L2 listeners.
Participants
A male L1 Austrian German speaker aged 68 years read the stimuli words and sentences. He was recruited due to his typical Austrian accent and his decade-long experience of using English in international contexts. He had learned English formally for 3 years.
The 423 nonnative listeners in this study (male = 142, female = 279, other = 2) came from a data set originally collected for a larger project on international intelligibility involving both native and nonnative speakers of English. They had been recruited via email, social media, and the author’s international contacts. Participants were aged 18–70 years, though most (79%) were 18–35 years old (mean age: 29.4 years). They came from numerous different L1 backgrounds, reflecting the diversity of international listeners around the world, though due to the sampling procedure, their L1s were mostly native to the European continent. A quarter (25%) were L1 speakers of Romance languages, 12% of Slavic languages, 11% of Germanic languages other than English, 7% of Finno-Ugric languages, and 4% of Greek. However, the sample also included L1 speakers of Turkish (12%); Chinese (6%); Thai, Arabic, or Iranian languages (4%, respectively); and Japanese (3%). There were no L1 Austrian German listeners in the study.
Participants indicated their listening proficiency in English using the self-assessment scale for listening in a foreign language of the CEFR (Council of Europe, 2018, p. 167; descriptors were slightly adapted for clarity). Each descriptor corresponds to one of the six proficiency levels of the CEFR, allowing straightforward comparison with studies such as Jenkins (Reference Jenkins2000, Reference Jenkins2002). Participants’ self-assessed listening proficiency levels are summarized in Table 2. About a third (36%) regarded themselves as highly advanced listeners (C2 level), more than a quarter (29%) thought they were low-advanced listeners (C1 level), and about a third believed to be at intermediate or upper-intermediate level (B1 or B2 level).
Procedure
The experiment was conducted using the questionnaire application SoSciSurvey (https://www.soscisurvey.de/). Participants were first asked to adjust the volume of their computers using a test audio spoken in a standard British accent.Footnote 6 The use of headphones and participation in calm surroundings was recommended. Participants were asked to provide consent and confirm that they had normal hearing and were not participating multiple times. They were then presented with the four conditions in randomized order. An autoplay function was used to prevent relistening to a test item. In all conditions, they had to type the target word in a designated gap on their screen. In the carrier sentences in the Syn, Syn+Sem, and Syn+Sch conditions (presented in writingFootnote 7 in addition to the auditory stimulus), the gap was in place of the target word. The schematic cue in the Syn+Sch condition appeared on screen 5 s prior to the carrier sentence and the onset of the audio.
To reduce guessing and avoid ceiling effects, participants received 11 s (C and Syn conditions) or 12 s (Syn+Sem and Syn+Sch conditions) to submit their answer from the onset of the audio. The time limit was slightly longer in the Syn+Sem and Syn+Sch conditions as here the carrier sentences were slightly longer than in the Syn condition. If participants finished typing before the time limit, they could proceed to the next trial using a button on their screens. Each condition started with a timed practice item containing neither of the two targeted vowels. Participants could control the start of the actual trials themselves. The intelligibility portion of the survey lasted for about 15 min (when using the maximum amount of time).
The experiment concluded with a follow-up questionnaire including 10 items to measure participants’ speech familiarity. Participants indicated their prior contact with different types of spoken English (“English spoken with an Austrian accent,” “English spoken with an accent similar to the Austrian accent,” “English spoken with another non-native accent,” “English spoken with a native accent,” and “spoken English in general”) on a 6-point scale (ranging from none at all to very much) and their current exposure to the same types of spoken English on a 7-point scale (ranging from never to daily). Additional data were collected on participants’ attitudes toward the speaker but not included in the present analysis due to lack of space.
Data analysis
Intelligibility—specifically, SWR—was operationalized as an exact word match transcription. Misspellings (apart from accidental capitalization or accidentally hitting a numeral key) were not accepted, which inevitably penalized lower proficiency listeners but which was necessary to ensure a satisfactory level of objectivity in data coding.Footnote 8 Added punctuation marks and transcriptions of parts of the carrier sentence in addition to the correct target word (e.g., “a van” for van) were accepted.
The data were analyzed using a generalized mixed effects model in R (version 4.1.3, https://www.R-project.org/) using RStudio (version 2022.2.0.443, http://www.rstudio.com/) and the lme4 package (version 1.1-28; Bates et al., Reference Bates, Maechler, Bolker and Walker2015). Correct word identification (yes/no) was entered as the binomially distributed dependent variable with fixed binomial totals of six (as each condition contained six target words). The model thus predicted the per-trial probability of a word being correctly identified (henceforth PCI) in terms of log odds, which can be converted into probabilities (0–100%). A fixed or random effect was included in the model if it decreased the Akaike information criterion (AIC; Akaike, Reference Akaike1974) and significantly improved the model’s fit (p < .05) as assessed by a log-likelihood ratio test. The optimizer bobyqa was used to avoid convergence errors.
In addition to the variables condition and listening proficiency, two measurements of speech familiarity were included in the model. These were component scores derived from a principal component analysis (PCA) involving the 10 items measuring participants’ speech familiarity, which was run on the entire data set (including native listeners and A1 and A2 listeners) from which the present subset of nonnative listeners was taken to preempt issues of collinearity in the regression analysis (e.g., Tomaschek et al., Reference Tomaschek, Hendrix and Baayen2018). The PCA revealed two dimensions of speech familiarity, a general one (henceforth general familiarity) relating to participants’ familiarity with other native and nonnative accents in English and with spoken English in general, and a specific one (henceforth specific familiarity) relating to the particular accent they were listening to (the Austrian accent) and similar accents in English.
The model’s random effects structure included a random intercept for listener as well as a by-listener random slope for condition to account for the fact that participants might react differently to the four conditions (i.e., to the availability of co-text and context). The benefit of this random slope was assessed by comparing two “beyond optimal” models (Zuur et al., Reference Zuur, Ieno, Walker, Saveliev and Smith2009), each of which contained fixed effects for condition, the two familiarity variables, and the interaction term Condition × Listening Proficiency but which differed with respect to the inclusion of the random slope. Despite the inclusion of the interaction term, which already captured a substantial amount of by-listener variance with respect to the effect of condition, the random slope still resulted in a significantly better model fit (p < .001, decrease in AIC: 49.2) but also a singular fit. Singular fits are not uncommon (Bates et al., Reference Bates, Maechler, Bolker and Walker2015, p. 25) but may indicate an overfitted model. However, the sizeable standard deviations for the Syn+Sem and the Syn+Sch slope (≥ 0.85 on the logit scale, Table 3) suggested that the random slope was not a superfluous effect. The singular fit was due to the high correlation of the random effects for the three slopes of condition. That is, participants who had higher slopes in the Syn condition also had higher slopes in the Syn+Sem and Syn+Sch conditions to virtually the same extent. This pattern was most likely a consequence of the experimental design rather than of overparameterization or the model failing to estimate these parameters correctly, with participants who profited particularly from the syntactic cue in the Syn condition also doing so in the Syn+Sem and the Syn+Sch conditions. The random slope was therefore retained, which also constituted the more conservative option as omitting random slopes can increase the danger of Type I errors (Barr et al., Reference Barr, Levy, Scheepers and Tily2013). Moreover, the model evaluation procedure for the model’s fixed-effects structure yielded exactly the same results with and without the random slope.
The model’s fixed effects structure was then constructed by forward stepwise regression, with condition being entered first, followed by listening proficiency, general familiarity, and specific familiarity. The Condition × Listening Proficiency interaction improved the model’s fit further and was therefore also included in the final model.
Results
Effect of condition across different listener groups
The summary of the final model is provided in Table 4. Intelligibility in the C condition for C2 listeners was set as the reference category (intercept). Condition significantly affected the intercept, with intelligibility being significantly higher for C2 listeners in the Syn, the Syn+Sem, and the Syn+Sch condition than in the C condition (p < .001, respectively), which is in line with the notion that highly advanced nonnative listeners are able to rely on co-text and context in SWR of another nonnative accent. Moreover, condition and listening proficiency interacted significantly in several combinations (ranging from p < .001 to p = .020), meaning that as expected the effect of condition was regulated by a listener’s proficiency level. This variability is visualized in Figure 1, which shows the estimated per-trial PCI across different listener groups and conditions. The PCI values (Table 5) have been obtained using allEffects() from the effects package (version 4.2-1; Fox, Reference Fox2003; Fox & Weisberg, Reference Fox and Weisberg2019).
Note. Signifiance codes: ***p < .0005, **p < .005, *p < .05.
Figure 1 illustrates that all listener groups profited from the cues in the Syn condition. The size of this effect is very similar for B2, C1, and C2 listeners, who all exhibit a significant gain in intelligibility (p < .001, respectively)Footnote 9 equaling an increase in PCI of 40%–44%. In fact, the effect of Syn is no different for C1 and B2 listeners than for C2 listeners (p = .242, p = .887). That is, upper-intermediate and low-advanced listeners seem to benefit from syntactic information when identifying words spoken in another nonnative accent to a very similar extent as highly advanced listeners. However, B1 listeners exhibit a significantly smaller effect of Syn than C2 listeners (p = .010), which nevertheless amounts to an increase in PCI of 29% and is significant at p < .001.
Moving on to the effect of additional semantic information, all listener groups are predicted to exhibit similar gains in PCI from the Syn to the Syn+Sem condition (21%–26%). For each of them, the effect of Syn+Sem vis-à-vis the Syn condition is significant at p < .001. Interestingly, the estimated gain in PCI is slightly smaller for C2 listeners (21%) than for B2 and C1 listeners (23% and 24%) and highest for B1 listeners (26%). However, this is not due to an increased ability in B1 listeners to profit from semantic cues but a ceiling effect on the probability scale for the other three groups, especially C2 listeners, who are approaching the maximum value of 100% PCI in the Syn+Sem condition. In other words, there is not much more to be gained in terms of intelligibility for these listeners. In terms of odds ratio, the effect of the Syn+Sem vis-à-vis the Syn condition is in fact highest for C2 listeners (OR = 9.67), followed by C1 listeners (OR = 6.31), B2 listeners (OR = 4.69), and B1 listeners (OR = 3.17). The difference between C2 listeners and C1 listeners in terms of this slope is nonsignificant (p = .298), but the intelligibility benefit from the Syn to the Syn+Sem condition is significantly greater for C2 listeners than for B2 listeners (p = .027) and B1 listeners (p < .001).
The additional schematic cue in the Syn+Sch condition is also predicted to result in gains in PCI relative to the Syn condition for all listener groups, albeit consistently lower ones than through the Syn+Sem condition. Moreover, the effect of the additional schematic cue is more variable across the different listener groups. Although it is significant for C2, C1, and B2 listeners (p < .001, respectively), it results in a substantially higher gain in PCI for the former two groups (17% and 19%) than for the latter (12%). The difference in slopes between C2 and C1 listeners is nonsignificant (p = .597), but the one between C2 and B2 listener is significant at p = .003. For B1 listeners, the predicted gain in PCI is quite low (7%) and the effect is nonsignificant (p = .298).
Effects of condition and speech familiarity compared
Figure 2 visualizes the model’s predictions regarding the simple main effects for general familiarity and specific familiarity. Both familiarity indices are estimated to significantly increase a listener’s chance to identify a word correctly (p < .001), with the effect being slightly stronger for general familiarity than for specific familiarity.
Figure 3, computed with the sjPlot package (version 2.8.10, by Lüdecke, Reference Lüdecke2021), provides an overview of how these effects compare to that of condition for each group of listeners in terms of OR. An OR below 1 signifies a decrease in the odds to correctly identify a word, whereas an OR above 1 signifies an increase. In each plot, the variables are ordered from greatest to smallest OR. For all four listener groups, the variables follow the same order: the variable with the highest OR (and thus the greatest effect on intelligibility) is the level Syn+Sem of condition, followed by Syn+Sch, Syn, and the two familiarity indices. Whereas the ORs of the three levels of condition vary across the graphs (due to the Listening Proficiency × Condition interaction), the values for the between-subject variables are the same in the four plots. As indicated above, the most important between-subject predictor is general familiarity, followed by specific familiarity. Whereas the effects of these two variables can be compared relatively easily as they constitute regression coefficients operating on similar scales, comparing their effects to that of a categorical variable such as condition is more challenging.
Importantly, the ORs of general familiarity and specific familiarity relate to an increase of 1 on the variable’s scale. The range of general familiarity is 6.5 (from -5.4 to 1.1), so an increase from minimum to maximum on the general familiarity scale amounts to an OR of 3.45 (i.e., 1.21^6.5, as with every 1-unit increase, the odds for correct SWR increase by a factor of 1.21). This is the maximum possible effect (MPE) of general familiarity on intelligibility. For specific familiarity (range = 4.3), the MPE amounts to an OR of 1.89. These MPEs are visualized in Figure 3 as red dashed lines. For C2, C1, and B2 listeners, the MPE of general familiarity is still considerably lower than the OR of Syn (see Figure 3a–c), which involved the embedding of the target word in a very simple syntactic structure. Thus, for these three listener groups, even the simplest co-textual information still has a greater positive influence on intelligibility than an extreme increase in general familiarity and in specific familiarity. Looking at the ORs of the Syn+Sch and the Syn+Sem condition, which range between 11.93 and 26.85 (Syn+Sch condition) and between 29.89 and 63.06 (Syn+Sem condition) for C2, C1, and B2 listeners, the MPEs of general familiarity and of specific familiarity seem even smaller.
For B1 listeners, the effect of Syn is not necessarily larger than the MPE of general familiarity (note the CIs in Figure 3d) but noticeably larger than the MPE of specific familiarity, with 95% certainty. The OR of the Syn+Sch condition is estimated to be either slightly higher, with 4.93, than the MPE of general familiarity or just about the same (see the lower bound of the CI of Syn+Sch in Figure 3d), whereas the OR of the Syn+Sem condition is estimated to be far greater, with 11.69. Both effects are far greater than the MPE of specific familiarity.
Item analysis
To examine whether the observed differences in intelligibility applied consistently to all six target words within each condition, an item analysis was conducted. As can be seen in Figure 4, the seven most difficult words included all six target words in the C condition as well as one word from the Syn condition, thus corresponding to the finding that words in the C condition were least intelligible to listeners. The difficulty of the Syn word gas, which was consistent across all listener groups (same rank as in Figure 4a for all of them), can be explained by its status as a special type of Trap-Dress MP word where both words are of the same POS. Because the simple syntactic co-text in the Syn condition was insufficient to disambiguate It’s a gas/guess, gas was similarly difficult as words in the C condition. In contrast, due to the additional semantic and schematic cues, the same type of MP word in the Syn+Sem and the Syn+Sch conditions (pan-pen, pants-pence) was much more intelligible (for a full discussion, see Thir, Reference Thir2020).
The most intelligible target words—that is, the top third, were mostly words in the Syn+Sem condition, occupying the top three ranks in Figure 4a. However, we also find two words from the Syn+Sch and the Syn conditions, respectively. Thus, the trend that Syn+Sem words were most intelligible only holds to some extent. Whereas the special status of the MP word pan (see above) might explain its somewhat reduced intelligibility compared with the other Syn+Sem words, it is difficult to explain the comparatively low intelligibility of van (though notably, it was still identified correctly in 75% of all cases). Interestingly, van underperformed to a similar extent for all listener groups, being in 13th (C1 listeners), 14th (C2 and B2 listeners), or 15th place (B1 listeners).
A somewhat mixed image appears when considering words in places 9–16 in Figure 4a. This middle third is mostly occupied by Syn+Sch words but also includes two words from the Syn and the Syn+Sem conditions, respectively. In particular, the Syn+Sch words cab and servant seemed to underperform. Regarding servant, this might be explained by difficulties in spelling and the strict coding scheme adopted, as it was the word exhibiting the greatest amount of spelling variation in the study. Had the most common variation (*servent) been accepted, an additional 8% of items would count as correctly identified. Notably, this spelling was proportionally much more frequent among B1 listeners (18% of all entries) than among all other listener groups (10% for B2 listeners and 5% for C1 and C2 listeners, respectively). A further explanation is imprecision in the schematic cue: Had the plural servants been accepted, which seems equally logical on the basis of the available cue, an additional 6% of answers would have been correct. Taking these two adjustments together, the word servant would have been correctly identified in 87% of all cases, putting it in the top third in Figure 4. The underperformance of cab, might, as one reviewer suggested, have to do with its somewhat outdated status in times of Lyft and Uber. Notably, this word was particularly tricky for B1 listeners, for whom it was the 5th most difficult word, so lack of familiarity with this word might indeed have been the issue.
Summarizing, the item analysis clearly mirrors the finding that words in the C condition were most difficult to understand. It also roughly reflects the finding that words in the Syn+Sem condition were easiest to understand, followed by words in the Syn+Sch condition, though these trends are less clear (see also Figure 4b). Regarding words in the Syn condition, there is quite some variability: whereas gas was particularly difficult for listeners (for the reasons mentioned above), the words nurse and bad were more intelligible than expected. The observed difference between the Syn and the Syn+Sem (or the Syn+Sch) conditions thus seems partly attributable to the difficulty of identifying a particular type of MP word in a semantically and schematically neutral sentence co-text.
Discussion
RQ1 asked whether nonnative listeners at different proficiency levels (intermediate to low-advanced) profit from different types of co-textual and contextual cues in SWR of another nonnative accent. The current study provided evidence for this assumption regarding syntactic, semantic, and schematic information, operationalized as the Syn, the Syn+Sem, and the Syn+Sch condition, for most proficiency groups examined. The results suggested that nonnative listeners below C2 level are able to exploit simple syntactic cues when listening to another nonnative accent, with the experienced intelligibility benefit being particularly substantial for low-advanced (C1) and upper-intermediate (B2) listeners. Moreover, listeners at intermediate level (B1) and upward also profited significantly and quite substantially from additional semantic cues (Syn+Sem condition). These findings are in line with other studies that found sentential co-text to benefit intelligibility to L2 listeners (e.g., Mack, Reference Mack1992; Lagrou et al., Reference Lagrou, Hartsuiker and Duyck2013; Osimk Reference Osimk2009) but contrast with an earlier claim that listeners below the highly advanced level would be unable to use such information compensatorily when listening to another nonnative accent (Jenkins, Reference Jenkins2000, Reference Jenkins2002). A beneficial effect of additional schematic information (Syn+Sch condition), however, could only be observed for upper-intermediate and low-advanced listeners in the present study, for whom it was markedly less pronounced than the effect of semantic information. In other words, nonnative listeners at intermediate level might not be able to benefit from additional schematic cues as operationalized in this study. One might argue that schematic information is less accessible to many nonnative listeners from various linguacultural backgrounds because schemata are highly culture dependent and thus necessarily more elusive than semantic or syntactic relationships between linguistic entities. However, part of this finding seems due to two underperforming target words in the Syn+Sch condition, one of which can be partly explained by a slight imprecision in the cue provided as well as spelling difficulties that seem to have affected B1 listeners in particular, whereas the other might be explained by a lack of word familiarity.
RQ2 asked whether nonnative listeners below C2 level benefit from syntactic, semantic, and schematic cues in SWR of another nonnative accent to a lesser extent than listeners at C2 level (i.e., highly advanced ones). The results of the current study suggest that upper-intermediate (B2) and low-advanced (C1) listeners do not differ from highly advanced nonnative listeners with respect to their experienced benefit of syntactic and semantic information when recognizing words spoken with another nonnative accent. Moreover, there was no difference between highly advanced and low-advanced listeners with respect to the benefit of an additional schematic cue. In contrast, upper-intermediate listeners may not be able to exploit such cues to the same extent as advanced listeners. Importantly, although the current study clearly suggests that there are differences between nonnative listeners at different proficiency levels with respect to their ability to profit from co-textual and contextual cues, it also shows that highly advanced nonnative listeners are by no means the only group to effectively exploit syntactic, semantic, and schematic cues when listening to another nonnative accent and that low-advanced and upper-intermediate listeners may differ little or not at all from them in this respect. This contrasts with an early claim regarding highly advanced listeners’ superior ability to profit from co(n)textual cues when listening to another nonnative accent (Jenkins, Reference Jenkins2000).
RQ3 asked about the relative influence of co-textual and contextual information on intelligibility compared to that of speech familiarity for nonnative listeners at different proficiency levels. For listeners at upper-intermediate level and above (B2–C2), the current study suggests the superior influence of co-textual and contextual information on intelligibility with respect to two types of speech familiarity (general familiarity, i.e., familiarity with different types of native and nonnative English accents and spoken English in general, and specific familiarity, i.e., familiarity with the particular accent one is listening to and similar accents). For B1 listeners, the data suggest that co-textual and contextual information are more important than specific familiarity and that a combination of syntactic and semantic information has a greater positive influence on intelligibility than general familiarity. However, whether general familiarity or syntactic information and syntactic plus schematic information is more important for intelligibility to B1 listeners could not be assessed with certainty.
Overall, co-textual and contextual information emerged as the most important variable for intelligibility to listeners at upper-intermediate level and above and as at least the second most important variable for intelligibility to listeners at intermediate level. These results stand in contrast to the idea that co-textual and contextual information would play a subordinate role for international intelligibility to most nonnative listeners due to their overreliance on acoustic information. However, this finding is reminiscent of Gass and Varonis’s (Reference Gass and Varonis1984) study, which compared different types of familiarity (e.g., familiarity with nonnative speech and familiarity with a particular nonnative accent) with respect to their effect on intelligibility to L1 English listeners. Notably, the type of familiarity most critical for increasing intelligibility was “familiarity with topic”—that is, contextual or background knowledge.
One limitation of the current study that needs to be acknowledged relates to its ecological validity. Clearly, real-world conditions of processing language differ from those in the current experiment, where listeners could focus on the written cues and audio stimuli. In naturally occurring interactions in international contexts, listeners might not be able to draw on co-text and context to the same extent due to processing overload (having to interact and listen at the same time) or because the surrounding co-text of a word is (partly) unintelligible (an issue prevented here by additionally presenting carrier sentences in writing), leading to greater dependency on the acoustic signal. However, the present experiment bore other challenges that listeners typically do not have to face in naturally occurring face-to-face encounters, such as a lack of visual support in the form of gestures or facial expressions or the need to transcribe the target word under a time constraint. It may be argued that the transcription task may have encouraged listeners to focus on a target word’s phonological form and thus resulted in a greater amount of bottom-up processing than in naturally occurring language use. From this point of view, the results of the present study appear particularly remarkable.
Another limitation is the fact that listening proficiency was determined via self-assessment, an approximate measure in comparison to, for example, standardized test scores. Consequently, the findings stratified by listening proficiency might only be approximate. This limitation was accepted in favor of obtaining a large, internationally diverse sample including numerous listeners from remote locations and of enabling straightforward comparison of the results to Jenkins’ (Reference Jenkins2000) findings with the help of the CEFR scale.
Finally, the fact that a single L2 talker supplied the stimuli for the present study reduces the generalizability of its results. More research is necessary to examine whether the proficiency-related modulation of co(n)textual effects on international intelligibility persists across L2 talkers of varying proficiency (and L1 background), ideally with L1 talkers as comparators. An inclusion of L1 listeners, which unfortunately exceeded the scope of this paper, would be equally desirable, to provide a more complete picture of international intelligibility.
Conclusion
Until today, co-textual and contextual effects have received relatively little attention in research on international intelligibility. This study has addressed this research gap, expanding the scope of existing intelligibility research concerning co-textual and contextual cues to nonnative LF listeners. Its results clearly demonstrate the significance of co-textual and contextual information for the intelligibility of L2-accented speech to L2 listeners at various proficiency levels, even when compared with speech familiarity. Its findings are thus in line with the idea that listening in general and SWR in particular are interactive processes in which bottom-up and top-down information are combined to arrive at a meaningful interpretation of speech. However, it also suggests that linguistic proficiency plays a crucial role in nonnative listeners’ ability to rely on co-text and context in SWR of another nonnative accent and that its regulatory power with respect to international intelligibility in general (see also Kang, Moran, et al., Reference Kang, Moran, Ahn and Park2020) and co(n)textual effects in particular deserves greater attention than it currently receives. Future research would thus benefit from adopting a more fine-grained differentiation between listeners at different proficiency levels rather than generalizing across nonnative listeners when studying international intelligibility, as has often been done (e.g., Deterding Reference Deterding2013, Gardiner Reference Gardiner2019, Jenkins Reference Jenkins2000).
Given the importance of co-textual and contextual effects for international intelligibility, it seems crucial that they receive greater attention both in academic research and in language pedagogy. Regarding the former, the findings of the present study call for a shift in focus, away from exclusively studying the role of particular pronunciation features, to encompass the co-textual and contextual factors that regulate the importance of target-like pronunciation for mutual understanding among nonnative users of English (see also Thir, Reference Thir2020). Another relevant factor in this respect (and fruitful avenue for future research) may be syllable structure and word length, as the importance of target-like sound production may also depend on how much compensatory phonological material within a word (i.e., word-internal co-text, often termed word-internal phonological context) is available to support listener understanding.
The methodological implications of the present study also need to be recognized, as co-text and context clearly constitute potential confounding factors that need to be taken account of in future research on international intelligibility. In particular, this concerns the choice and construction of stimuli texts and sentences, whose comparability in terms of semantic and schematic predictability needs to be ensured in order to avoid co(n)textually-induced biases. Otherwise, differences in intelligibility between, for example, different sound substitutions might arise not because one feature is actually easier to understand but because the co(n)text in which it occurred gave stronger cues about the intended words than in the case of other features.
Concerning language pedagogy, there are several ways in which the importance of co-textual and contextual information for mutual understanding among nonnative users could be addressed in the English language classroom. Although learners will obviously benefit from pronunciation instruction to attain a certain threshold of international intelligibility and from increasing their familiarity with different accents of English, it seems important to also equip them with communicative skills that will enhance their intelligibility whenever their pronunciation skills may fail them. This involves co(n)textualizing—namely, supporting listener understanding by providing helpful co-textual and contextual cues (e.g., a topic cue or an associated term) to bring them into the right frame of mind to successfully recognize an intended word (for further suggestions, see Kennedy & Trofimovich, Reference Kennedy and Trofimovich2008). Learners should also be supported in developing co(n)textual sensitivity—that is, an understanding of when they will most likely have to engage in co(n)textualizing to preempt loss of intelligibility (e.g., when talking about a topic listeners are unfamiliar with) and when they will have to pay particular attention to their pronunciation because their interlocutor might be largely dependent on the acoustic signal(e.g., when the possibility to co(n)textualize is limited). As information on listeners’ background knowledge is not always available, developing an adaptability to spontaneously engage in contextualizing (and a sensitivity to when this becomes necessary) seems equally important. Such strategies and metalinguistic knowledge might be helpful in preempting miscommunication among interlocutors from different linguacultural backgrounds in the first place, making international communication in a vast number of real-world settings more efficient and less prone to communication breakdown.