Highlights
-
• Mandarin-English bilinguals use segments as the primary unit in L2 phonological encoding.
-
• The encoding unit is the same for high- and low- proficient bilinguals.
-
• The segmental effects increase with more overlapping segments.
-
• The segmental effects decrease as stimulus onset asynchronies increase.
1. Introduction
Speech production is a skilled cognitive action to convey thoughts via audible sounds. During speech production, speakers need to go through different stages, that is, conceptual preparation, lexical selection, phonological encoding, and articulation (e.g., Dell, Reference Dell1986; Levelt et al., Reference Levelt, Roelofs and Meyer1999). Abstract lexical information is transcoded into physical speech sounds during phonological encoding. Dysfunction at this stage is one of the main reasons that cause anomia in aphasic patients (e.g., Calabria et al., Reference Calabria, Grunden, Iaia and García-Sánchez2020; Schwartz, Reference Schwartz2014) and tip-of-the-tongue instances in healthy speakers (e.g., Sadat et al., Reference Sadat, Martin, Costa and Alario2014).
It is generally agreed upon that segments are the primary phonological encoding units of spoken word production in Indo-European languages (e.g., Damian & Dumay, Reference Damian and Dumay2007, Reference Damian and Dumay2009; O’Seaghdha et al., Reference O’Seaghdha, Chen and Chen2010 for English; Roelofs, Reference Roelofs1999 for Dutch). For instance, if a speaker plans to say the word “monkey,” the segments /m/, /ʌ/, /ŋ/, /k/, /i/ will be retrieved, respectively, as well as its metrical framework (i.e., a disyllabic structure with lexical stress on the first syllable). After accessing the set of segments and the corresponding metrical frame, the segmental information is inserted into the metrical frame in a rightward incremental fashion to construct the syllables [‘mʌŋ.ki] (syllable boundaries indicated by dots; e.g., Cholin et al., Reference Cholin, Schiller and Levelt2004; Meyer & Schriefers, Reference Meyer and Schriefers1991; Roelofs, Reference Roelofs2015; Wheeldon & Levelt, Reference Wheeldon and Levelt1995; see Figure 1).

Figure 1. Model of phonological encoding for English and Mandarin Chinese (adapted from Schiller, Reference Schiller2006, and Zhang et al., Reference Zhang, Zhu and Damian2018). The apostrophe marks the stress position in English and the number marks the lexical tone in Mandarin Chinese, with “2” indicating a rising tone.
In the form preparation paradigm, speakers generally respond faster in a segment-homogeneous condition compared to a heterogeneous condition (Alario et al., Reference Alario, Perre, Castel and Ziegler2007; Damian & Bowers, Reference Damian and Bowers2003; Jacobs & Dell, Reference Jacobs and Dell2014; Meyer, Reference Meyer1991). This suggests that speakers can prepare overlapping segments. Further evidence for the segment as the encoding unit has also been reported in other speech production paradigms, such as in the picture-word interference paradigm (e.g., Damian & Martin, Reference Damian and Martin1999; Meyer & Schriefers, Reference Meyer and Schriefers1991) and the masked priming paradigm (e.g., Forster & Davis, Reference Forster and Davis1991; Malouf & Kinoshita, Reference Malouf and Kinoshita2007; Schiller, Reference Schiller1998, Reference Schiller2000).
However, for Mandarin Chinese, studies found that the primary phonological encoding units in speech production are more likely to be syllables instead of segments. Studies using various paradigms have demonstrated that syllabic overlap (e.g., “鼻 /bi2/” and “笔 /bi3/”) instead of segmental overlap (e.g., “鼻 /bi2/” and “布 /bu4/”) significantly affects speech production in Mandarin Chinese (e.g., masked priming paradigm, Cai et al., Reference Cai, Yin and Zhang2020; Chen et al., Reference Chen, Lin and Ferrand2003; Chen et al., Reference Chen, O’Séaghdha and Chen2016; Zhang & Damian, Reference Zhang and Damian2019; picture-word interference paradigm, Zhang & Yang, Reference Zhang and Yang2005; picture naming paradigm, You et al., Reference You, Zhang and Verdonschot2012). Please note that phonemic effects were observed in ERPs (see Cai et al., Reference Cai, Yin and Zhang2020; Qu et al., Reference Qu, Damian and Kazanina2012), which were suggested to reflect a phonemic encoding stage after syllabic encoding (Cai et al., Reference Cai, Yin and Zhang2020). O’Seaghdha et al. (Reference O’Seaghdha, Chen and Chen2010) proposed the proximate units principle to explain differences in phonological encoding units across languages. With this principle, O’Seaghdha et al. (Reference O’Seaghdha, Chen and Chen2010) refer to the proximate units as the primary phonological encoding units, that is, the first explicitly selectable phonological production units. According to this principle, the primary phonological encoding units have cross-linguistic variations. Specifically, segments are claimed to be the primary phonological encoding units in Indo-European languages (e.g., O’Seaghdha et al., Reference O’Seaghdha, Chen and Chen2010; Roelofs, Reference Roelofs1999) but syllables in Chinese (e.g., Cai et al., Reference Cai, Yin and Zhang2020; Zhang & Damian, Reference Zhang and Damian2009; Zhang & Yang, Reference Zhang and Yang2005; see Figure 1).
With such cross-linguistic differences, researchers have been drawn to the mechanisms of phonological encoding in bilinguals. It is believed that bilinguals have shared lexical representations across languages (e.g., Macizo, Reference Macizo2016), although there are disputes over whether a non-target language’s phonological form is activated in speech production of bilinguals (see, e.g., Costa et al., Reference Costa, Miozzo and Caramazza1999; De Bot, Reference De Bot1992; Green, Reference Green1998; Poulisse & Bongaerts, Reference Poulisse and Bongaerts1994 for the Language-Specific Phonological Activation account, see Costa, Reference Costa, Kroll and De Groot2005 for a review; and see, e.g., Macizo, Reference Macizo2016; Nakayama et al., Reference Nakayama, Verdonschot, Sears and Lupker2014; Spalek et al., Reference Spalek, Hoshino, Wu, Damian and Thierry2014; Thierry & Wu, Reference Thierry and Wu2004; Xu et al., Reference Xu, Lin and Dong2021; Zhang et al., Reference Zhang, Qian and Zhu2021 for the Language Non-specific Phonological Activation account). In second language (L2) speech production, bilinguals may recruit the processing mechanisms of their native language (i.e., L1) to produce L2, leading to the assimilation hypothesis (e.g., Liu et al., Reference Liu, Hu, Qu, Zhang, Su, Li and Mei2023; Xin et al., Reference Xin, Lan and Zhang2020) or recruit addition neural networks to accommodate L2 processing, leading to the accommodation hypothesis (e.g., Cao et al., Reference Cao, Tao, Liu, Perfetti and Booth2013), respectively.
In the phonological encoding stage of L2 speech production, it remains unresolved whether Mandarin Chinese-English bilinguals are influenced by their native language (i.e., syllables as primary units) or conform to L2 (i.e., segments as primary units). Previous studies have shown discrepancies in terms of the primary phonological encoding units in L2 speech production (e.g., Li et al., Reference Li, Wang and Davis2017; Timmer & Chen, Reference Timmer and Chen2017; Verdonschot et al., Reference Verdonschot, Nakayama, Zhang, Tamaoka and Schiller2013; Wang et al., Reference Wang, Wong and Chen2021; Xin et al., Reference Xin, Lan and Zhang2020). For instance, using a colored picture-naming task where participants produced noun phrases (e.g., 藍駱駝, /laam4/ /lok3to4/, “blue camel”), Timmer and Chen (Reference Timmer and Chen2017) reported a (onset) segment priming effect for Dutch-Cantonese bilinguals in their L2 (i.e., Cantonese), whose phonological encoding units are believed to be larger than the phoneme (e.g., Wong et al., Reference Wong, Huang and Chen2012). Their results indicate that Dutch-Cantonese bilinguals employed the L1 (i.e., Dutch) phonological encoding units to encode their L2. However, Xin et al. (Reference Xin, Lan and Zhang2020) reported syllabic priming effects for English-Mandarin Chinese bilinguals when they named pictures in L1 or L2 in the picture-word interference paradigm, suggesting that they relied on the same phonological encoding units as Mandarin Chinese native speakers. Xin et al. (Reference Xin, Lan and Zhang2020) explained this inconsistency was caused by the language environment in which the experiments were carried out (see, Li & Wang, Reference Li and Wang2017 for the influence of language environment on L1 phonological encoding). Specifically, participants whose daily language environment is Mandarin Chinese use the same phonological encoding units as native Mandarin Chinese speakers when they produce L2-Mandarin Chinese.
The study by Li et al. (Reference Li, Wang and Idsardi2015) suggests that tasks that explicitly require orthographic information processing, such as associative naming cued by visually presented prompt words, encourage participants to employ different phonological encoding units in their L1 production. Nevertheless, in the two studies above (i.e., Timmer & Chen, Reference Timmer and Chen2017; Xin et al., Reference Xin, Lan and Zhang2020), although orthographic information processing is not required in the picture naming tasks in both studies, participants use different phonological encoding units in L2 production. Therefore, the cross-task differences cannot completely explain the discrepant findings in Mandarin Chinese and Indo-European languages.
Furthermore, differences in L2 proficiency may contribute to different processing mechanisms of phonological encoding during L2 speech production. It is suggested that the degree to which bilinguals inhibit the non-response language is dependent on their L2 proficiency (e.g., Costa et al., Reference Costa, Colomé, Gómez and Sebastián-Gallés2003; Costa & Santesteban, Reference Costa and Santesteban2004; Guo & Peng, Reference Guo and Peng2006; Nakayama et al., Reference Nakayama, Kinoshita and Verdonschot2016; see Jiao et al., Reference Jiao, Grundy, Liu and Chen2020 for a review of executive control to manage bilingual processing), and thus high and low proficiency bilinguals may demonstrate differences in response times in speech production (e.g., Dash & Kar, Reference Dash and Kar2020; De Bot, Reference De Bot2004; Macizo, Reference Macizo2016). For instance, Nakayama et al. (Reference Nakayama, Kinoshita and Verdonschot2016) recruited Japanese-English bilinguals with high or low L2 (i.e., English) proficiency and asked them to read aloud English words preceded by masked primes that overlapped in just the onset segment (e.g., bark-BENCH) or the onset segment plus the following vowel corresponding to the mora-sized units CV (consonant + vowel; e.g., bell-BENCH). Participants demonstrated different phonological encoding units in L2 (i.e., English) spoken word production, that is, high proficiency Japanese-English bilinguals showed a significant onset segment priming effect while the low proficiency group showed CV priming, indicating that high proficiency bilinguals used segments as the primary phonological encoding units while low proficiency bilinguals used the mora-sized units CV (Nakayama et al., Reference Nakayama, Kinoshita and Verdonschot2016).
Similar findings were also reported by Verdonschot et al. (Reference Verdonschot, Nakayama, Zhang, Tamaoka and Schiller2013) who used a masked priming-naming task to investigate Mandarin Chinese-English bilinguals’ L2 speech production. They found that bilinguals with high L2 proficiency showed a significant masked onset segment priming effect in L2 production, employing the same phonological encoding units (i.e., segments) as English native speakers did. The results of Nakayama et al. (Reference Nakayama, Kinoshita and Verdonschot2016) and Verdonschot et al. (Reference Verdonschot, Nakayama, Zhang, Tamaoka and Schiller2013) suggest that the primary phonological encoding units produced by high-proficiency bilinguals were accommodated to their L2, even when their language environment is not L2. This finding also contradicts that of Timmer and Chen (Reference Timmer and Chen2017) who found that bilinguals’ L2 phonological encoding units were assimilated to their L1. However, the study of Verdonschot et al. (Reference Verdonschot, Nakayama, Zhang, Tamaoka and Schiller2013) did not include bilinguals with low L2 proficiency, but the results of such participants are necessary to resolve the discrepancy.
Given the cross-linguistic differences in primary phonological encoding units as well as the influence of L2, it is necessary to resolve the discrepancies over the primary phonological encoding units in L2, especially with varied L2 proficiency. Therefore, we aim to investigate the primary phonological encoding units in the L2 production of Mandarin Chinese-English bilinguals with high and low L2 proficiency who are not immersed in L2, to avoid possible influence from the language environment (see, e.g., Li & Wang, Reference Li and Wang2017; Xin et al., Reference Xin, Lan and Zhang2020). The present study addresses the following research questions: (1) What are the primary phonological encoding units of Mandarin Chinese-English bilinguals when they utter L2? More specifically, will the bilinguals encode L2 words using L1 units or L2 units? (2) Are there any differences between high and low-proficiency Mandarin Chinese-English bilinguals in terms of the primary phonological encoding units? Based on previous research in Japanese (Nakayama et al., Reference Nakayama, Kinoshita and Verdonschot2016), we hypothesize that Mandarin Chinese-English bilinguals with high L2 proficiency use segments as the primary phonological encoding units in L2 speech production, whereas Mandarin Chinese-English bilinguals with low L2 proficiency use syllables as the primary phonological encoding units.
2. Methods
2.1. Participants
Two groups of native Mandarin Chinese speakers differing in their English proficiency participated in this study. They were recruited from a university in Northern China. All participants were right-handed, with normal or corrected-to-normal vision. The students were paid for their participation and signed an informed consent letter. The high L2 proficiency group (Group 1) consisted of 30 students majoring in English (2 males; average age = 22 years; SD = 1.91 years). All of them passed the Test for English Majors-Band 4 (TEM-4) and/or the TEM-8 when applicable. TEM-4 and TEM-8 are authoritative tests to judge the English proficiency of university undergraduate English majors in China (Chen, Reference Chen2022). Participants who are able to pass these two tests are generally considered to have a relatively high proficiency in English. The low L2 proficiency group (Group 2) consisted of another 30 students (6 males; average age = 19.54 years; SD = 0.88 years), who had studied English for less than four semesters at the university according to a systematic curriculum. These participants had passed the College English Test Band 4 (CET-4), which is a large-scale test used to test the English proficiency of Chinese non-English majors (Wu et al., Reference Wu, Chen and Zheng2022), indicating that they were equipped with general knowledge of English, but less L2 experience and lower L2 proficiency than Group 1. Before the experiments, all participants were asked to fill out the Language Experience and Proficiency Questionnaire (LEAP-Q; Marian et al., Reference Marian, Blumenfeld and Kaushanskaya2007), and their self-assessment scores were listed in Table 1. The differences between scores of high and low-proficiency bilinguals were significant (ps < .0001).
Table 1. Self-assessment scores for the L2 English language skills from high and low proficiency bilinguals; the level was marked from 1 to 10, with 10 being the highest

2.2. Design
The present study employed the picture-word interference paradigm, which is sensitive to the phonological relationship between the target picture and the distractor word (e.g., Levelt et al., Reference Levelt, Schriefers, Vorberg, Meyer, Pechmann and Havinga1991; Meyer & Schriefers, Reference Meyer and Schriefers1991; Starreveld, Reference Starreveld2000). The picture-word interference paradigm is a widely used paradigm to investigate the process of speech production (e.g., Cai et al., Reference Cai, Yin and Zhang2020; Wong et al., Reference Wong, Huang and Chen2012; Xin et al., Reference Xin, Lan and Zhang2020; Zhang & Yang, Reference Zhang and Yang2005). In this paradigm, participants are required to name the target picture while trying to ignore the distractor word, which is superimposed on the line drawing portraying concrete objects (Glaser & Düngelhoff, Reference Glaser and Düngelhoff1984) and shares certain properties with the target picture name. The target picture and the distractor may appear at pre-determined stimulus onset asynchronies (SOAs, the time duration between the distractor and the target) to reveal the time course of any potential effect. The studies of both Mandarin Chinese (e.g., Bi et al., Reference Bi, Xu and Caramazza2009; Wang et al., Reference Wang, Wong and Chen2021; Zhang & Yang, Reference Zhang and Yang2005; Reference Zhang and Yang2006; Zhao et al., Reference Zhao, La Heij and Schiller2012) and English (e.g., Damian & Martin, Reference Damian and Martin1999; Jescheniak & Schriefers, Reference Jescheniak and Schriefers2001) manifested relatively stable phonological effects at positive SOAs (i.e., the target picture appears prior to the distractor word). The phonological forms of both the target picture and the distractor word will be activated as soon as they are retrieved, and the phonological relatedness will facilitate the naming process (see Bürki, Reference Bürki2017 for a review). Thus, the current study chose three positive SOAs where phonological relatedness has been reported to facilitate picture naming (0 ms, 75 ms, and 150 ms, see also e.g., Wang et al., Reference Wang, Wong and Chen2021; Zhang & Weekes, Reference Zhang and Weekes2009) to investigate the primary phonological encoding units in L2 (i.e., English) spoken word production by Mandarin Chinese-English bilinguals with varied L2 proficiency.
Meanwhile, the degree of phonological relatedness between the target word and the distractor word was manipulated. There were four distractor types for each target, according to the extent of overlap in their phonological forms, that is, (1) syllabic overlap (S+), (2) two-segment overlap (P2+), (3) one-segment overlap (P1+), and (4) unrelated (U). The experimental design included two factors: Distractor Type (4 conditions: S+, P2+, P1+, U) and SOA (3 levels: 0 ms, 75 ms, 150 ms). There were 480 trials in total (40 pictures × 4 conditions × 3 SOAs), blocked by SOA. All trials were presented pseudo-randomly to make sure the same condition would not appear in two consecutive trials. The sequence of trials was counterbalanced across participants. There were self-paced rests between blocks. The materials and design were identical for the two groups.
2.3. Materials
Twenty-five target pictures were selected from CRL-IPNP (CRL International Picture Naming Project; Bates et al., Reference Bates, Federmeier, Dan, Iyer and Pechmann2000) and the standardized Snodgrass and Vanderwart picture databases (Snodgrass & Vanderwart, Reference Snodgrass and Vanderwart1980) or drawn similarly. Target picture names were all monosyllabic. There were four distractor types for each target, according to the extent of overlap in their phonological forms: syllabic overlap (S+), two-segment overlap (P2+), one-segment overlap (P1+), unrelated (U). For instance, one target picture was a line drawing of a nest, and its distractor words were: nest (S+), neck (P2+), nap (P1+), and salt (U). Distractor words and target pictures were matched in terms of word frequency, t = −.658, p = .512, based on the log frequency in the SUBTLEX-UK database (Van Heuven et al., Reference Van Heuven, Mandera, Keuleers and Brysbaert2014), and visual complexity (number of letters), t = −.473, p = .638. Each pair of distractor and target pictures was semantically unrelated. They were also considered phonologically unrelated in their Chinese translations, except for one or two instances of onset or rhyme overlap between the target and one of the distractor conditions. Nevertheless, since the Chinese translations of English words are not a one-to-one correspondence, the rare instances of onset or rhyme overlap should not affect our results. Another 15 picture names were selected as fillers from the same database.
2.4. Procedure and analysis
Participants were seated in a comfortable chair in a quiet room facing a computer screen, approximately 60 cm away from the screen. Before starting the experiment, the participants filled out the Language Experience and Proficiency Questionnaire (LEAP-Q; Marian et al., Reference Marian, Blumenfeld and Kaushanskaya2007) and signed an agreement to participate in the experiment voluntarily.
The experiment consisted of a familiarization, a practice session, and a formal experimental session. Participants were first presented with the line drawings on the screen with the target names underneath. After being familiarized with all the target pictures, they were asked to name the pictures in English without the names presented. Mistakes that occurred were reported to the participants and corrected by the experimenter.
The formal experiment started with a fixation cross “+” appearing in the middle of the screen for 300 ms. After the fixation cross disappeared, a blank screen appeared and lasted for 20 ms. Then, a target picture was presented with a distractor word superimposed at different SOAs. At last, the picture-word combination disappeared by the vocal trigger or after 2 s if the participants failed to name the targets. The whole experiment lasted about 25 minutes. The procedure was identical for the two groups. The whole procedure of the experiment is illustrated in Figure 2.

Figure 2. Procedure of the experiment.
The experiment was conducted with PsychoPy2 Version 2021.2 (Peirce et al., Reference Peirce, Gray, Simpson, MacAskill, Höchenberger, Sogo, Kastman and Lindeløv2019) with stimuli presented on a 15-inch computer screen 60 cm away from the participant. The reaction times (RTs, i.e., the naming latencies) were measured online by an HP laptop microphone. RTs were collected and manually checked using the program CheckVocal (Protopapas, Reference Protopapas2007) based on the participants’ vocal responses. R Version 3.1.0 (R Core Team, 2014) was used to analyze participants’ picture naming RTs. The initial model was built employing the “lmer4” package (Bates et al., Reference Bates, Maechler, Bolker and Walker2014) with two predictors: distractor type and SOA, the interaction between distractor type and SOA, and two random intercepts: participants and target pictures. The naming latencies showed a skewed distribution and were therefore log-transformed. The log-transformed naming latencies were submitted to the mixed-effects modeling in R as the dependent variable. The data analysis procedure was identical for both groups. There was a significant interaction between distractor type and SOA for both groups of participants (ps < .001). Therefore, the data were then divided into three subsets per SOA. Separate models were built with the distractor type and SOA levels as the fixed predictor and random intercepts for participants and target pictures.
3. Results
3.1. Group 1 – high L2 proficiency
3.1% of 9,000 data points, including incorrect naming and false voice triggering (2.46%) and outliers (i.e., data points that exceed a participant’s mean RTs by 3 SDs, 0.64%), were excluded from further analysis. A total of 8,725 data points were submitted to R. The error rates were relatively low, and thus not included in further statistical analysis. Descriptive statistics are provided in Table 2.
Table 2. Mean reaction times (RTs) in ms and standard deviation (SD) for high proficiency bilinguals

For high proficiency bilinguals, at SOA = 0 ms, the model showed significant differences between the unrelated condition and other phonologically related conditions, suggesting that phonological relatedness facilitated picture naming. However, at SOA = 75 ms, the significant phonological facilitation effects were only obtained for the P2+ and S+ conditions, and at SOA = 150 ms, only for the S+ condition. See Table 3 for the results summary for high-proficiency bilinguals.
Table 3. Results for coefficient estimates, standard errors (SE), t values, and p values for the effect of distractor type in each SOA condition for high proficiency bilinguals

As shown in Figure 3, Tukey’s multiple comparison tests showed that the differences between S+ and P2+ conditions reached significance over the range of SOA from 0 ms to 150 ms, βs < .0308, ps < .001, which revealed that facilitation increased as the amount of overlap increased. Moreover, for P2+ and P1+ conditions, the average time differences of 31 ms reached significance when SOA was 0 ms, β = −.017, p < .001, and when SOA was 75 ms with an average 20 ms difference, β = −.012, p = .018. However, the average 7 ms difference at SOA = 150 ms did not reach significance, β = .003, p = .849.

Figure 3. RT differences between the unrelated and phonologically related conditions for high proficiency bilinguals in Group 1. The dashed lines below the RT bars represent pairwise comparison results between adjacent levels in the chart (* p < .05; ** p < .01; *** p < 0.001).
3.2. Group 2 – low L2 proficiency
6.04% of 9,000 data points were discarded (4.73% errors and 1.31% outliers). A total of 8,456 data points were submitted to R. The error rates were relatively low and thus were not included in further statistical analysis. Descriptive statistics are provided in Table 4 and detailed results are provided in Table 5.
Table 4. Mean reaction times (RTs) in ms and standard deviation (SD) for low proficiency bilinguals

Table 5. Results for coefficient estimates, standard errors (SE), t values and p values for the effect of distractor type in each SOA condition for low proficiency bilinguals

For low proficiency bilinguals, at SOA = 0 ms, 75 ms, and 150 ms, the model showed significant differences between the unrelated condition and other phonologically related conditions, suggesting that phonological relatedness facilitated picture naming at all the predefined SOAs. See Table 5 for the summary of results for low-proficiency bilinguals.
As shown in Figure 4, Tukey’s multiple comparison test was carried out and showed that the differences between the S+ and P2+ conditions reached significance at all SOA conditions, βs < .031, ps < .001. For P2+ and P1+ conditions, the 40 ms difference at SOA = 0 ms was significant, β = −.021, p < .001, and the 26 ms difference at SOA = 75 ms was also significant with β = −.012, p = .023. However, the effect at SOA = 150 ms was not significant, β = −.002, p = .986.

Figure 4. RT differences between the unrelated and phonologically related conditions for low proficiency bilinguals in Group 2. The dashed lines below the RT bars represent pairwise comparison results between adjacent levels in the chart (* p < .05; ** p < .01; *** p < 0.001).
4. Discussion
Using the picture-word interference paradigm, the present study examined the primary phonological encoding units of L2 spoken word production in Mandarin Chinese-English bilinguals with high and low L2 proficiency. In both groups of participants, phonological facilitation effects were observed with segmental overlap (one or two segments) and syllabic overlap, suggesting Mandarin Chinese-English bilinguals employed segments as the primary phonological encoding units during spoken word production in their L2, resembling the units employed by native English speakers.
In both groups, overlap in the onset segment produced significant phonological facilitation effects, suggesting that Mandarin Chinese-English bilinguals use segments as the primary phonological encoding units in L2 spoken word production regardless of their L2 proficiency. The onset priming effect is consistent with the one reported by Schiller (Reference Schiller2000) in an English monolingual picture naming task, which investigated the functional role of segments in English phonological encoding. As syllables were assumed to be the primary phonological encoding units in Mandarin Chinese (e.g., Cai et al., Reference Cai, Yin and Zhang2020; Chen et al., Reference Chen, Chen and Dell2002; O’Seaghdha et al., Reference O’Seaghdha, Chen and Chen2010; Zhang & Yang, Reference Zhang and Yang2005), it seemed that Mandarin Chinese-English bilinguals employed language-specific units, that is, segments, to perform phonological encoding when producing their L2. This finding indicates that Mandarin Chinese-English bilinguals adopt an additional system for L2 phonological processing, supporting the accommodation hypothesis.
In addition to the onset priming effect observed with the masked priming paradigm (Verdonschot et al., Reference Verdonschot, Nakayama, Zhang, Tamaoka and Schiller2013), the onset priming effect was reinforced with picture naming. Apart from the study with only highly proficient bilinguals (Verdonschot et al., Reference Verdonschot, Nakayama, Zhang, Tamaoka and Schiller2013), our study further revealed that even when the participants’ L2 proficiency was relatively low, segments were still employed as the primary phonological encoding units.
However, the finding of the low-proficiency group using segments as the primary phonological encoding units is inconsistent with that of Nakayama et al. (Reference Nakayama, Kinoshita and Verdonschot2016), where low-proficiency Japanese-English bilinguals showed CV priming but not segmental onset priming. One possible reason for the discrepancy is that most Mandarin Chinese speakers use Pinyin, an alphabetic transcription system to represent the sounds of the language, as the input method in typing, whereas Japanese speakers tend to use kana that usually represents a CV structure in typing. However, Japanese speakers may use “romaji,” similar to Pinyin, when typing on a computer keyboard. The other possible reason is that the former study employed a reading-aloud task with prime words, but we used the picture naming task with visual distractors, which could contribute to the different results of Nakayama et al. (Reference Nakayama, Kinoshita and Verdonschot2016) and our studyFootnote 1. Further research is needed to examine these possibilities.
In addition, we observed increasing effects with more overlapping segments during L2 phonological encoding, which was consistent with the results in Dutch (Schiller, Reference Schiller1998) and English (Schiller, Reference Schiller1999, Reference Schiller2000) native speakers. Specifically, in both groups, we observed the time difference reached significance between the S+ and P2+ conditions as well as the P2+ and P1+ conditions with varied SOAs (except for SOA = 150 ms) in both groups. The increasing effects of segmental overlap, with more overlapping segments producing larger facilitation effects in L2 production (see Figures 3 and 4), were consistent with the predictions that the overlapping segments increased the activation level of the target’s phonemes and thus facilitated the syllabification at the phonological word (Levelt et al., Reference Levelt, Roelofs and Meyer1999; Meyer & Schriefers, Reference Meyer and Schriefers1991; Wheeldon, Reference Wheeldon2003).
Although both groups of participants showed the segmental priming effect in phonological encoding, these two groups’ performances were different in terms of distractor type and SOA. Specifically, the high-proficiency bilinguals seemed to have a naming advantage in L2 over low-proficiency bilinguals, based on a post-hoc t-test between the mean reaction times of the two groups in all the conditions (t = −18.332, p < .0001). One of the probable reasons for the naming speed difference could be lexical competition between L1 and L2, with L1 causing stronger interference in the low proficiency group (Colomé, Reference Colomé2001; Costa et al., Reference Costa, Colomé and Caramazza2000; Guo & Peng, Reference Guo and Peng2006; Hoshino & Thierry, Reference Hoshino and Thierry2011; Macizo, Reference Macizo2016; Sullivan et al., Reference Sullivan, Poarch and Bialystok2018). However, it could also be that the ability of lexical access of the high proficiency group becomes better with increased L2 proficiency. Still, another possibility is that the prolonged naming could be caused by the delay at the L2 phonetic encoding stratum. Previous studies suggested that the disadvantages in the speed of speech production originated from the phonetic encoding level, which prolonged verbal action manner (e.g., Broos et al., Reference Broos, Duyck and Hartsuiker2018). Future research is needed to explore these different possibilities directly.
Additionally, there was a significant interaction between distractor type and SOA in both groups. Specifically, the priming effect was smaller at larger positive SOAs, and it was even absent in the P1+ condition at SOA = 75 ms, as well as the P1+ and P2+ conditions at SOA = 150 ms for high proficiency bilinguals. One possible reason is that the process of phonological encoding is (nearly) finished at these later points in time, especially for the high-proficiency group who tends to have faster word production. Specifically, based on the temporal signature of word production components proposed by Indefrey and Levelt (Reference Indefrey and Levelt2004), lexical access starts within the time window of 250 ms after stimulus onset in spoken word encoding, followed by phonological encoding, which starts from phonological code retrieval at around 330 ms, online syllabification at around 455 ms, ending with phonetic encoding at approximately 600 ms. Crucially, the encoding takes about 25 ms per phonemic segment for native Dutch speakers (Van Turennout et al., Reference Van Turennout, Hagoort and Brown1997), while the speed may be slower for L2 learners in processing their weaker language (e.g., Dash & Kar, Reference Dash and Kar2020; De Bot, Reference De Bot2004; Macizo, Reference Macizo2016). The mean number of phonemic segments of the target picture name was around four in our experiment. Thus, the segmental encoding cost would be around 100 ms for four phonemic segments, and the recognition of a distractor takes about 100 ms (e.g., Hauk et al., Reference Hauk, Davis, Ford, Pulvermüller and Marslen-Wilson2006). Therefore, the distractor might be presented too late to affect the production process. In other words, the segmental encoding process might be finished by high-proficiency bilinguals after the effective recognition of a distractor at SOAs of 75 ms and 150 ms in the P1+ and P2+ conditions. Comparatively, under the S+ and P2+ conditions, when SOA = 0 ms and SOA = 75 ms, the facilitation effects were obtained with enough processing time for both distractor word and target picture. Speakers benefit from the activated segments which primed the shared phonological codes and produced the segmental priming effect. Furthermore, the segmental priming at larger SOAs is more likely to be absent in the P1+ condition than the P2+ condition, compared to the robust priming in the syllabic overlap condition (i.e., lexical overlap) at all the specified SOAs, suggesting that the first segment is encoded first and then the second. Nevertheless, more fine-grained research is needed to make further conclusions.
One caveat of the current study is that all target words were monosyllabic. One consequence is that distractor words in the syllabic overlap condition are identical to the target words. This also explains why the syllabic priming effects are the most prominent across all the SOAs. It has been shown that in the form preparation paradigm, Mandarin Chinese-English bilinguals and Japanese-English bilinguals manifest only syllabic preparation effects but not phonemic effects in disyllabic word production (Li et al., Reference Li, Kronrod and Wang2020). Future cross-paradigm studies with polysyllabic words are necessary to further investigate the syllabic priming effects. Nevertheless, the finding of the syllabic priming does not compromise the findings of the segmental priming effects.
To interpret our results within the framework of the WEAVER++ model (Levelt et al., Reference Levelt, Roelofs and Meyer1999; Roelofs & Meyer, Reference Roelofs and Meyer1998) and the schematic representation of the lexical system of bilinguals (Costa et al., Reference Costa, La Heij and Navarrete2006), we assume that in the process of L2 picture naming for Mandarin Chinese-English bilinguals, after the selection of lexical concepts, the activation spreads to corresponding lemma nodes in both L1 and L2 (see also Costa & Caramazza, Reference Costa and Caramazza1999). Following lexical selection, the respective phonological forms are activated followed by the phonological encoding of the target words. Although our study did not directly investigate L1 activation in L2 production, our results are compatible with this account in terms of the suggested possibility of L1 interference causing lexical competition in L2 production. Nevertheless, in terms of the phonological encoding units in L2 production, we did not observe any apparent influence from L1.
Finally, the findings of the current study may have some pedagogical implications for L2 speech learning and segmental acquisition, as well as pronunciation instruction. Since this study demonstrated the significant role of segmental encoding in L2 production in Mandarin Chinese-English bilinguals regardless of their L2 proficiency, teachers should make students aware of the importance of segments. Studies examining the impact of segmental-based pronunciation instruction on intelligibility have demonstrated instructional gains (e.g. Saito, Reference Saito2011; Saito & Lyster, Reference Saito and Lyster2012). Teachers may help students analyze their pronunciation features and help them identify and deal with features they find difficult to pronounce or discriminate (Wang, Reference Wang2022).
In conclusion, we have investigated the primary phonological encoding units of L2 speech production for both high- and low- proficiency bilinguals. We found that Mandarin Chinese-English bilinguals, regardless of their L2 proficiency, employed segments as the primary phonological encoding units to process L2, demonstrating that they use the accommodation mechanism. In addition, we observed the decrease or even absence of facilitation with fewer overlapping segments at later SOAs. Our results shed light on the detailed underlying mechanism of L2 phonological encoding and may provide implications for L2 segmental acquisition and pronunciation instruction.
Data availability statement
The data that support the findings of this study are openly available in OSF at https://osf.io/tb7pq/?view_only=e5a69b04a8fa45fa8c9ff338aaf9d5e1.
Acknowledgements
We thank our participants for their participation.
Funding statement
This research was supported by a grant from the National Social Science Fund of China (Grant No. 24CYY098) awarded to M.W. N.O.S. is supported by grant no. 9380177 from CityUHK.
Competing interest
The authors declare none.