Introduction
Infants are surrounded by rich and constantly varying speech input. For example, the same word can drastically differ acoustically when produced in different sentences, by different speakers, with different emotions, or for different purposes. One formidable task that infants face when learning language is to extract the invariable linguistic content from varying acoustical input, a task that adults accomplish with ease (Tuninetti et al., Reference Tuninetti, Chládková, Peter, Schiller and Escudero2017). Yet despite the various approaches adopted, when and how children succeed in phonetic categorization (i.e., to perceive acoustically discriminable tokens as functionally equivalent) has been a central yet still not fully understood question for language and cognitive science (Choi & Shukla, Reference Choi and Shukla2021; Crinnion et al., Reference Crinnion, Malmskog and Toscano2020; Mulak et al., Reference Mulak, Bonn, Chládková, Aslin and Escudero2017; Schatz et al., Reference Schatz, Feldman, Goldwater, Cao and Dupoux2021; Swingley & Alarcon, Reference Swingley and Alarcon2018).
With electoencephalogram (EEG), the current study tested 20-month-old toddlers’ phonetic categorization of similar sounding native vowels, and aimed to find neural signatures under categorical discrimination. The toddlers’ brain responses were collected when they passively listened to a minimal pair of consonant-vowel-consonant (CVC) nonwords (i.e., the two nonwords solely differed in the vowel). Furthermore, this study investigated whether vocabulary size could be a contributing factor in speaker normalization at the neural level.
Normalizing speaker variability is an important aspect of phonetic categorization. The same phoneme or word can drastically differ acoustically when produced by different speakers, due to the differences in size of the vocal tract, pattern of vocal fold vibration, and palate shape, as well as accent and social status (Harnsberger et al., Reference Harnsberger, Brown, Shrivastav and Rothman2010; Johnson & Sjerps, Reference Johnson and Sjerps2018), yet adults readily recognize the linguistic content of speech forms in the face of this variability. Normalizing speaker variability is a formidable task for young language learners when abstracting phonetic categories. Infants and young children are sensitive to both linguistic (i.e., vowel) and indexical (i.e., speaker) change of isolated vowels (Mulak et al., Reference Mulak, Bonn, Chládková, Aslin and Escudero2017). Two studies indicated that by five years, children still had more difficulty recognizing words when these were produced by multiple speakers than by one single speaker (Goldinger et al., Reference Goldinger, Pisoni and Logan1991; Ryalls & Pisoni, Reference Ryalls and Pisoni1997). Speaker variability is particularly challenging for infants’ discrimination of acoustically similar vowels, since these overlap in terms of the first (F1) and second (F2) formants, the primary acoustic determinant of vowels, and individual speakers differ in terms of absolute F1 and F2 values (Feldman, Griffiths, et al., Reference Feldman, Griffiths, Goldwater and Morgan2013; Johnson & Sjerps, Reference Johnson and Sjerps2018).
Different hypotheses have been proposed with regard to how infants learn to disambiguate overlapping phonetic categories. The distributional learning account hypothesizes that infants form phonetic categories by grouping the sounds that cluster in the perceptual space (Maye et al., Reference Maye, Werker and Gerken2002). However, such account has been questioned seeing that across languages, F1/F2 distribution of vowels in infant directed speech is insufficient to reliably delineate vowel boundaries (Cristia & Seidl, Reference Cristia and Seidl2014; Englund & Behne, Reference Englund and Behne2005; Kuhl et al., Reference Kuhl, Andruski, Chistovich, Chistovich, Kozhevnikova, Ryskina, Stolyarova, Sundberg and Lacerda1997; Martin et al., Reference Martin, Schatz, Versteegh, Miyazawa, Mazuka, Dupoux and Cristia2015; McMurray et al., Reference McMurray, Kovack-Lesh, Goodwin and McEchron2013), and some evidence reveals that infants are sensitive to phonetic details that are too fine-grained and, thus, may hinder categorization (Schatz et al., Reference Schatz, Feldman, Goldwater, Cao and Dupoux2021). Other studies have argued that word knowledge may guide infants to dissociate overlapping vowel categories. Given that infants learn to segment and recognize word forms in the same period as they learn to establish native phonological categories, knowledge of word forms (with or without semantic content) may scaffold discrimination of overlapping phonetic categories among preverbal infants (Feldman, Myers, et al., Reference Feldman, Myers, White, Griffiths and Morgan2013; Yeung & Werker, Reference Yeung and Werker2009). For example, the distribution of /ɛ/ and /æ/ overlap acoustically in English across different speakers; however, if the vowel in “cat” always has a higher F2 and lower F1 than the vowel in “bed” in the speech input, the lexical context may serve as an index of the different distribution of the acoustic characteristics of vowels (Swingley, Reference Swingley2009; Swingley & Alarcon, Reference Swingley and Alarcon2018). Furthermore, the lexical restructuring model hypothesizes that vocabulary expansion, or the pairing of sound forms to meanings, is a driving force for refining phonological representation. For instance, knowing that /ʃip/ pairs with one object and /ʃIp/ with another will help infants to segment the words into phonemes and to understand that /i/ and /I/ are different vowels that distinguish word meaning (Metsala & Walley, Reference Metsala, Walley and Ehri1998; Walley et al., Reference Walley, Metsala and Garlock2003). Therefore, word templates, or the context in which particular phonemes occur, may serve as a helpful cue for young children to disambiguate phonetic categories that cannot be readily distinguished by acoustic information.
So far, however, the relationship between phonetic categorization and vocabulary development has not been tested directly. Previous studies have shown that behavioral and neural discrimination of native phonemes in the first year of life correlate with both concurrent and future vocabulary (Garcia-Sierra et al., Reference Garcia-Sierra, Rivera-Gaxiola, Percaccio, Conboy, Romo, Klarman, Ortiz and Kuhl2011; Kuhl et al., Reference Kuhl, Conboy, Coffey-Corina, Padden, Rivera-Gaxiola and Nelson2008; Singh, Reference Singh2019). Yet it is in the second year of life that infants’ vocabulary starts to quickly expand (Goldfield & Reznick, Reference Goldfield and Reznick1990; Nazzi & Bertoncini, Reference Nazzi and Bertoncini2003), and as hypothesized by the lexical restructure model (Metsala & Walley, Reference Metsala, Walley and Ehri1998; Walley et al., Reference Walley, Metsala and Garlock2003), the vocabulary effect on phonetic categorization is expected to be most evident at the early stage of a vocabulary spurt. At the early stage of vocabulary development, substantial individual variability exists ( Chen et al., Reference Chen, Wijnen, Koster and Schnack2017; Fenson et al., Reference Fenson, Dale, Reznick, Bates, Thal, Pethick, Tomasello, Mervis and Stiles1994). One toddler may already know hundreds of words while another knows only a dozen. If word knowledge indeed relates to phonetic categorization, then those toddlers with a large vocabulary should be more capable of categorizing speech sounds than those with a small vocabulary.
To test the concurrent association between word knowledge and phonetic categorization, the current study tested Dutch 20-month-old toddlers on their neural discrimination of acoustically similar native /I/ and /i/ contrast embedded in CVC nonwords (i.e., giep [ɣIp] and gip [ɣip]), and examined whether vocabulary level relates to categorical discrimination of the nonwords. In Dutch, both /i/ and /I/ are front unrounded vowels, and they cannot be unequivocally distinguished by duration (Adank et al., Reference Adank, van Hout and Smits2004). By 20 months, toddlers were still not able to fully accurately encode minimally different vowels in both familiar and novel words(Mani et al., Reference Mani, Coleman and Plunkett2008; Nazzi, Reference Nazzi2005). Thus, speaker normalization of these vowels was expected to be challenging for toddlers. Such immature and developing vowel representation, together with the quick vocabulary expansion at this age, should make the mutual influence between vowel categorization and word knowledge (if any) evident. In the current study, the infants’ vocabulary was measured by the Dutch version McArthur-Bates Communicative Development Inventory (N-CDI) (Zink & Lejaegere, Reference Zink and Lejaegere2002). If word knowledge facilitates vowel categorization, when speaker variation was introduced, the toddlers with a large vocabulary should exhibit stronger neural discrimination of the nonwords than those with a small vocabulary.
Neural discrimination of the stimuli was tested with mismatch responses (MMRs). For adults, mismatch negativity (MMN; (Peltola et al., Reference Peltola, Kujala, Tuomainen, Ek, Aaltonen and Näätänen2003; Saloranta et al., Reference Saloranta, Alku and Peltola2020) is a component of auditory event-related potentials (ERP). MMN can be elicited using a passive oddball paradigm, in which listeners are presented with a stream of repeating ‘standard’ sounds conforming to a certain regularity punctuated occasionally by ‘deviant’ sounds, dissimilar in some relevant dimension from the standards. If the brain detects the change from standard to deviant, then on the difference waveform obtained by subtracting the response to the standard from that to the deviant, the MMN is visible as a negative peak between 100 and 300 ms from deviant onset in adults (Näätänen et al., Reference Näätänen, Paavilainen, Rinne and Alho2007). In addition to physical difference between the standard and deviant, MMN can also be elicited by violation of abstract patterns (Xiao et al., Reference Xiao, Wong, Wang, Zhao, Zeng, Yip, Wong and Tse2018), and listeners are able to extract similarity within the variable standards and deviants and detect dissimilarity across the two types. Previous research has shown that for adults, while the brain is sensitive to speaker difference (Tuninetti et al., Reference Tuninetti, Chládková, Peter, Schiller and Escudero2017), it is able to separate the phonologically contrastive vowel categories in the face of speaker variability (Shestakova et al., Reference Shestakova, Brattico, Huotilainen, Galunov, Soloviev, Sams, Ilmoniemi and Näätänen2002, Reference Shestakova, Brattico, Soloviev, Klucharev and Huotilainen2004).
For infants and young children, unlike adults, MMN does not yield consistent temporal and spatial characteristics. A commonly observed developmental pattern is the shift of mismatch response polarity – namely, that at an early stage of development, infants’ and children’s mismatch responses tend to exhibit a late positivity (positive mismatch response, p-MMR) rather than the adult MMN (He et al., Reference He, Hotson and Trainor2009; Morr et al., Reference Morr, Shafer, Kreuzer and Kurtzberg2002). As infants grow older, the polarity of the mismatch response shifts to negative, and gradually approximates the adult MMN (Bishop et al., Reference Bishop, Hardiman and Barry2011; Morr et al., Reference Morr, Shafer, Kreuzer and Kurtzberg2002). The shift from p-MMR to MMN has been found to occur earlier for acoustically salient (i.e., when the stimuli showed larger physical difference) than non-salient stimuli (Cheng et al., Reference Cheng, Wu, Tzeng, Yang, Zhao and Lee2015; He et al., Reference He, Hotson and Trainor2009; Lee et al., Reference Lee, Yen, Yeh, Lin, Cheng, Tzeng and Wu2012). In addition, several studies have demonstrated co-occurring p-MMR and MMN among infants (Leppänen et al., Reference Leppänen, Eklund and Lyytinen1997; Morr et al., Reference Morr, Shafer, Kreuzer and Kurtzberg2002; Yu et al., Reference Yu, Tessel, Han, Campanelli, Vidal, Gerometta, Garrido-Nag, Datta and Shafer2019), and, thus, the p-MMR and MMR are likely to be generated by difference neural processes.
Speech categorization continues to develop through adolescence (McMurray et al., Reference McMurray, Danelz, Rigler and Seedorff2018). For consonants, some studies found evidence for successful speaker normalization at the neural level in early infancy (Dehaene-Lambertz & Baillet, Reference Dehaene-Lambertz and Baillet1998; Dehaene-Lambertz & Pena, Reference Dehaene-Lambertz and Pena2001; van Leeuwen et al., Reference van Leeuwen, Been, Kuijpers, Zwarts, Massen and van der Leij2006). Yet, given the high degree of acoustic overlap between the vowels, it remains inconclusive whether young children are able to normalize speaker variability when discriminating similar sounding vowels. Furthermore, the neural signature underlying vowel categorization among infants is largely unknown. A number of previous studies made use of one single standard and one single deviant as stimuli to test neural discrimination of vowels, and results are inconsistent with regards to the MMR polarity and scalp distribution (Čeponien et al., Reference Čeponien, Lepistö, Alku, Aro and Näätänen2003; Cheour-Luhtanen et al., Reference Cheour-Luhtanen, Alho, Kujala, Sainio, Reinikainen, Renlund, Aaltonen, Eerola and Näätänen1995; Lee et al., Reference Lee, Yen, Yeh, Lin, Cheng, Tzeng and Wu2012; Marklund et al., Reference Marklund, Schwarz and Lacerda2019; Shafer et al., Reference Shafer, Yu and Garrido-Nag2012). For example, MMN (i.e., negative mismatch response) was reported for native synthesized vowels among Finnish newborns (Cheour-Luhtanen et al., Reference Cheour-Luhtanen, Alho, Kujala, Sainio, Reinikainen, Renlund, Aaltonen, Eerola and Näätänen1995) as well as preschool Finnish children (Čeponien et al., Reference Čeponien, Lepistö, Alku, Aro and Näätänen2003). English learning infants, on the other hand, showed a p-MMR to the acoustically non-salient native /I/ and /ɛ/ contrast up to 30 months, and afterwards adult-like MMN started to emerge (Shafer et al., Reference Shafer, Yu and Datta2011, Reference Shafer, Yu and Garrido-Nag2012). A further study demonstrated that among the young children one early MMR (160-360 ms) and one late MMR (400-600 ms) co-occurred, and both shifted gradually from positive to negative as the infants grew older (Yu et al., Reference Yu, Tessel, Han, Campanelli, Vidal, Gerometta, Garrido-Nag, Datta and Shafer2019). For Swedish infants, the native /i/-/e/ contrast elicited p-MMR among 4- to 8-month-olds (Marklund et al., Reference Marklund, Schwarz and Lacerda2019). For Mandarin infants, the mismatch response elicited by the large change from /a/ to /u/ (i.e., differed in both roundedness and backness) shifted from positive to negative at 6 months, while the mismatch response elicited by the small change from /a/ to /i/ (i.e., both are front and unrounded) stayed positive in preschool years (Cheng et al., Reference Cheng, Wu, Tzeng, Yang, Zhao and Lee2015; Lee et al., Reference Lee, Yen, Yeh, Lin, Cheng, Tzeng and Wu2012). In addition, Shafer et al. (Reference Shafer, Yu and Garrido-Nag2012) suggested that, in addition to maturity, attention may modulate the polarity of the mismatch responses among infants. Importantly, seeing that only one single token of each vowel was used as stimuli in these studies, whether and when the infant brain is able to discriminate between vowel categories remains unknown. It should be noted however, 2-3-month-old infants showed mismatch response to a non-native vowel contrast after being exposed to a bimodal distribution of 900 acoustically different tokens of these vowels for a few minutes (Wanrooij et al., Reference Wanrooij, Boersma and Van Zuijen2014), and by three months, the infant brain was able to encode consonants in the manner and place of articulation, which were elementary units robust to variability (Gennari et al., Reference Gennari, Marti, Palu, Fló and Dehaene-Lambertz2021). To sum up, the development of vowel elicited MMR seems to be language and stimuli specific, and although previous research hinted at early categorization of vowels at the neural level, direct experimental evidence is lacking.
To fill in the void, the current study tested 20-month-old toddlers’ phonetic categorization in two separate conditions. In the experimental condition, both the nonwords giep [ɣIp] and gip [ɣip] were produced by 12 different speakers; thus, only if the infant brain was able to normalize speaker variation and regard the 12 different tokens of each nonword as the “same”, would they be able to neurally discriminate the two nonwords. In the control condition, one single token of each vowel produced by one speaker was used as stimuli, and neural discrimination would succeed if the toddlers could detect the acoustical difference between two stimuli. Four hypotheses were proposed. First, if the toddlers were able to normalize speaker variability, a significant mismatch response (either positive or negative) should be observed in the multiple-speaker condition. Second, if variability overburdened vowel discrimination, the mismatch response should be attenuated in the multiple-speaker compared to the single-token condition. Third, if categorical perception of vowels was well established by 20 months, the mismatch response should exhibit an adult-like negative polarity. Fourth, if vocabulary level related to neural discrimination of the vowels, the toddlers with a large vocabulary should exhibit either a more evident MMR or differ from those with a small vocabulary in terms of MMR polarity.
Materials and methods
Participants
Fifty-nine 20-month-old infants participated in the current study. Six infants were excluded from analysis due to crying or excessive head movements. The remaining 53 participants (29 girls, age range 19 months 1 day to 20 months 23 days) had a mean (SD) age of 601 (13) days, and all were all full-term healthy infants from monolingual Dutch families, and no hearing or neurological disorders were reported by the parents. All the parents gave written consent for participating in the experiment. The experiment was approved by Utrecht University Faculty Ethics Assessment Committee Humanities.
Materials
Twelve female native Dutch speakers (mean age = 21 years, SD = 2.5 years) were recruited to produce the stimuli. They were first visually familiarized with a list of printed CVC nonword minimal pairs, where the two nonwords in a pair were solely distinguished by the vowel (e.g., tos [tos] and toes [tus], nief [nif] and nif [nIf]). They could spend as much time as they needed to get familiar with these nonwords. The target nonwords for the current experiment, giep [ɣIp] and gip [ɣip], formed one of these pairs. The rest of the nonwords were for a word learning experiment which together with the current experiment formed parts of a larger project examining early phonology and word development. Next, the participants were asked to produce all the nonwords in carrier sentences ik zei niet X maar ik zei Y as well as ik zei niet Y maar ik zei X (meaning I did not say X but I said Y, and vice versa), where X and Y were one of the minimal pairs. They were told to speak the sentences in a way as if they were talking to a toddler. The speakers were recorded with a Sennheiser ME-64 microphone and a DAT recorder TASCAM DA-40 in a sound attenuated room.
For each speaker, one well-realized token of [ɣIp] and one token of [ɣip] were cut off from the recording for further manipulation in PRAAT (Boersma & Weenink, Reference Boersma and Weenink2011). All the tokens had a falling f0 contour. The duration of the tokens was manipulated to have a mean of 344 ms (SD = 9.5 ms, range 323-361 ms) and the intensity was scaled to 70dB. These manipulated [ɣip]s and [ɣIp]s were used stimuli in the current experiment. All the twelve tokens of each nonword were presented in the multiple-speaker condition, while one token of each nonword from one speaker was presented in the single-speaker condition. Duration and mean f0 were measured for the vowel part (i.e., /i/ and /I/), and F1, F2 and F3 values were measured at the temporal midpoint of the steady part of the vowels. The [ɣIp] and [ɣip] used in the single-speaker condition had an F1 of 348 Hz and 485 Hz, an F2 of 2827 Hz and 2361 Hz, and an F3 of 3432 Hz and 3100 Hz respectively. The acoustical characteristics of the vowels in the stimuli were consistent with those reported in previous studies (Adank et al., Reference Adank, van Hout and Smits2004). Multiple native Dutch adult speakers listened to the stimuli and reported the stimuli to be natural, and all were able to identify all the stimuli as [ɣIp] or [ɣip] correctly. The acoustic measurements of the stimuli are listed in Table 1 in the supplementary materials. The oscillograms of the stimuli can be found in Figure 2 in the supplementary materials.
We used CVC nonwords rather than isolated vowels as stimuli because /i/ can occur as a reduced format of hij (meaning he) in colloquial Dutch, such as in the sentence wat doet ie (meaning what does he do), whereas /I/ alone can never be a word. Using nonwords as stimuli precluded lexical status from being a confounding factor. Although the Dutch /i/ and /I/ may be considered to contrast in duration besides F1/F2, acoustical analysis has shown that duration does not distinguish these vowels sufficiently (Adank et al., Reference Adank, van Hout and Smits2004; Swingley, Reference Swingley2019). Dutch, however, does have long and short vowels, such as in maan [ma:n] (meaning moon) and man [mɑn] (meaning man), and both Dutch adults and infants were found to be sensitive to long vowels being mispronounced as short ones but not vice versa (Chládková et al., Reference Chládková, Escudero and Lipski2015; Dietrich et al., Reference Dietrich, Swingley and Werker2007). Therefore, to prevent duration from being a confounding factor, the vowels in [ɣIp] and [ɣip] were not manipulated to contrast in duration.
In both conditions, [ɣip] was the standard and [ɣIp] was the deviant. It should be acknowledged that according to previous studies on vowel perception, to detect a change from a less to a more peripheral vowel was easier than the other way around (Kuhl, Reference Kuhl1991; Polka & Bohn, Reference Polka and Bohn2011) – hence, the assignment of standard and deviant may have an influence on the mismatch response. Yet it was not our purpose to investigate asymmetry in neural detection of the vowel change. The effect of word knowledge could be better interpreted with a consistent assignment of standard and deviant across the participants, since any relationship found between the MMR and vocabulary cannot be due to the confounding factor of standard/deviant assignment.
The infants’ vocabularies were measured by the Dutch version McArthur-Bates Communicative Development Inventory: Words and Sentence (N-CDI) (Zink & Lejaegere, Reference Zink and Lejaegere2002), which includes a total of 702 words, divided into 22 different semantic categories.
Procedure
A passive oddball paradigm was adopted. Infants’ brain responses were recorded in two blocks: a multiple-speaker block followed by a single-speaker block. As the focus of the current study was phonetic categorization, the multiple-speaker block always preceded the single-speaker condition.
Each block comprised 600 trials, of which 480 (80%) were standards and 120 (20%) deviants, and each trial was composed of a single sound token. Each block began with 10 trials of the standard, after which standards and deviants were presented continuously in a pseudo-random order with the constraint that deviants were separated by at least two standards. The inter-trial interval was randomly varied between 320 ms and 400 ms.
The EEG was recorded in a sound-attenuated room at Utrecht University. The infant participants sat on their caregivers’ laps during the experiment. Infant friendly silent animated videos were played on the computer screen, and parents were instructed not to talk during the experiment. Toys were placed on the table in front of the infant, with which they could play if they wanted to. The distance between the participant’s eyes and the screen was ~1 m and the experimental stimuli were presented at 70dB SPL (measured at where the infant sat) through two audio speakers on each side of the screen. EEG was recorded with a Biosemi system from a 32-channel cap with Ag/AgCL electrodes according to the 10–20 International System of Electrode Placement. EEG was recorded at a sampling rate of 2048 Hz. During online recording, a 5th order Bessel filter for optimal pulse response was used by Biosemi system, and the -3 dB frequency of the filter is placed at 1/4 of the (fixed) sample frequency.
The parents filled in the N-CDI at home online. For each word, they were asked to indicate by mouse clicking whether their child “understands but does not produce yet”, “understands and produces”, or “does not understand and does not produce yet”. The raw scores were automatically generated with locally developed software. The N-CDI was filled in either before or after the experiment, and there was an average of eight days (SD = 9 days) between the experiment and the filling of the N-CDI.
EEG processing
The EEG data were analysed offline using EEGLAB toolbox (version 13.1.1b in Matlab 2015b, (Delorme & Makeig, Reference Delorme and Makeig2004). The raw recordings were filtered between 0.3–20 Hz, and down-sampled to 250 Hz. The continuous recordings were re-referenced to the average of all electrodes and were segmented into 700ms epochs from the 100ms before the onset (baseline) to 600ms after the stimulus onset. Continuous bad channels were visually inspected and Spline-interpolated. 27 participants had channels interpolated, and on average 0.96 (SD = 0.88) channels was interpolated. Trials having an amplitude larger than ±150 microvolts were removed. The standards immediately after a deviant were excluded from analysis. The remaining artefact-free trials were averaged to obtain the ERPs for each infant. Infants who had more than 25 artefact-free deviant trials were included in the final analysis, and all the infants met this criterion. Individual waveforms were averaged to obtain the grand averaged waveform. In the multiple-speaker condition, mean (SD) number of accepted standard and deviant trials were 271 (34) and 91 (11) respectively, and in the single-speaker condition 265 (45) and 88 (15) respectively.
Statistical analysis
A two-step analysis was adopted. First, to test whether the standard and the deviant EPRs differed significantly for any time window, with all the participants collapsed, for each condition, point-by-point t tests were performed with the standard and deviant ERPs (i.e., ERPs of individual participant) for all the points between 200 and 600ms after the stimulus onset for F3, Fz, and F4 separately. If for at least one electrode, the standard and deviant ERPs significantly differed for at least six consecutive time points (i.e., 24 ms with the sampling rate being 250 Hz), the difference was considered as meaningful (Chen et al., Reference Chen, Peter, Wijnen, Schnack and Burnham2018; Cheng et al., Reference Cheng, Wu, Tzeng, Yang, Zhao and Lee2015). Subsequently, an MMR peak was identified on the grand average. Second, for amplitude measures, individual MMR peak latencies were identified in the 100ms window (50ms before and after) surrounding the grand average peak, and individual MMR peak amplitudes were calculated as the mean amplitude in the 40 ms (20 ms before and after) window surrounding the corresponding individual peak. The mean amplitude of the standard and deviant ERPs was calculated for the same time window as for the MMR. As MMR is most evident for frontal and central locations, the frontal (F3, Fz, F4) and central electrodes (C3, Cz, C4) were included in the analysis. To investigate whether the MMRs were more frontally or centrally distributed, a repeated measures ANOVAs with type (standard or deviant), site (front or central electrodes), lateralization (left, middle, or right) and condition (multiple- and single-condition) being the within-subject variables were conducted with individual peak amplitudes at the six electrodes. As the significant interaction between type and site confirms the frontal distribution of the MMRs (see Results), further analyses were conducted with the MMR amplitudes at F3, Fz, and F4 to investigate the effect of condition.
To investigate the effect of vocabulary size, the participants were median split into a high comprehension group (HC, N = 26) and a low comprehension group (LC, N = 27) based on the total comprehension score (i.e., the sum of the words checked by parents as “understands but does not produce yet” and “understands and produces” ) of the N-CDI. Seeing that the p-MMRs were most evident at the frontal sites (see analyses below), p-MMR amplitudes were examined with mixed effect ANOVAs with lateralization (left F3, middle Fz, or right F4), and condition (multiple- and single-speaker) being the within-subject variable and vocabulary level (HC and LC) being the between-subject variable. The same analysis was also conducted with productive scores, which can be found in supplementary materials.
Results
All participants
Table 1 lists the time points where the standard and deviant ERPs differed significantly as shown by the point-by-point t tests. Figure 1 plots the standard ERPs, the deviant ERPs and the difference waves in the multiple- and single-speaker conditions averaged across all the participants.
As can be seen from Table 1 and Figure 1, a significant p-MMR was observed for each condition between 300 ms and 400 ms after the stimulus onset based on the point-by-point t tests, and these P-MMRs were most visible at F3. Therefore, grand average p-MMR peaks were identified between 300ms and 400ms at F3 for each condition, and the grand average had a peak latency of 336ms and 356ms in the multiple- and single-speaker condition, respectively. When individual peak latencies were averaged, the p-MMRs had a mean (SD) peak latency of 332 ms (25 ms) in the multiple-speaker condition and 359 ms (27 ms) in the single-speaker condition, and the latency difference between the two conditions was significant, t(52) = 5.28, p < .001, partial η2 = .35. Table 2 in the supplementary materials lists the mean peak amplitudes at F3, Fz, F4, C3, Cz, and C4 for the multiple- and single-speaker condition. Figure 2 plots the p-MMR topography at the corresponding grand average peak latency in the multiple- and single-speaker condition.
Repeated measures ANOVAs with type (standard or deviant), site (front or central electrodes), lateralization (left, middle, or right) and condition (multiple- and single-condition) being the within-subject variables found significant main effect of site (frontal or central), F(1, 52) = 46.93, p < .001, partial η2 = .47, and lateralization (left, middle, or right) F(2, 104) = 13.89, p < .001, partial η2 = .21, showing that the ERPs were larger at frontal than at central electrodes, and they were more positive at the left than the right electrodes. The main effect was not significant for either condition F(1, 52) = 1.48, p = .23, partial η2 =.03 or type F(1, 52) = 1.85, p = .18, partial η2 =.03 turned out to be significant. The interaction between conditions and site was significant F(1, 52) = 13.80, p < .001, partial η2 = 0.21, and so was the interaction between type and site F(1, 52) = 20.39, p< .001, partial η2 = 0.28. Together with Figure 1, it can be seen that the differences between standard and deviant ERPs were most evident at frontal electrodes, and this difference is much attenuated at the central electrodes, and overall, the ERPs are left lateralized. Therefore, further analysis was conducted with the MMR amplitudes at F3, Fz, and F4. There was a significant main effect of lateralization, F(2, 104) = 5.57, p = .005, partial η2 = 0.10 while the main effect of conditions F(1, 52) = 0.61, p = .44, partial η2 = 0.01, and the interaction between the two factors F(2, 104) = 0.55, p = 0.58, partial η2 = 0.01 were not significant. Post-hoc Bonferroni corrected pairwise comparison showed that the p-MMR at F3 was significantly larger than that at Fz and F4 (p < .05).
These results showed that in both the single- and the multiple-speaker condition, frontally distributed p-MMRs were elicited, and they had comparable peak amplitudes across the conditions. With regard to latency, the p-MMR had an earlier peak latency in the multiple-speaker than in the single-speaker condition.
Comprehensive vocabulary groups
There are in total 702 words listed in the N-CDI. The participants had a mean comprehension raw score of 318 (SD = 126, range 50-571) and a mean production raw score of 133 (SD = 133, range 3-408). A median (270) split was applied to the comprehension raw scores (i.e., total number of words checked by parents as “understands but does not produce yet” and “understands and produces” ) of CDI, and those infants who had a score higher than the median were considered as high comprehenders (HC, N=26) and the rest low comprehenders (LC, N=27). The HC group had a mean comprehension score of 419 (SD = 84) and a mean production score of 188 (SD = 125), while those for the LC group were 219 (SD = 69) and 80 (SD = 63) respectively. The same median split was also applied to the production raw scores. The high producers produced a mean of 221 words (SD = 95), and the low producers produced a mean of 48 words (SD = 30).
Table 2 lists the time windows where the standard and deviant ERPs differed significantly and the mean of individual p-MMR peak latencies for the HC and LC group separately. As can be seen, point-by-point t tests identified p-MMRs for both groups in both conditions. As the difference between standard and deviant ERPs was most frequently observed at F3 (except for HC in the single-speaker condition), grand average p-MMR peak latencies were identified between 300ms and 400ms after the stimulus onset at F3. The LC group had a grand average peak latency of 336 ms and 356 ms and the HC group 340 ms and 364 ms in the multiple- and single-speaker condition respectively. Table 3a and Table 3b in the supplementary materials list the individual amplitudes of the standard, deviant ERPs, and the p-MMR. Figure 3 plots the difference waves of the HC and LC groups in the multiple- and single-speaker condition. The standard and deviant ERPs of the two groups can be found in the supplementary materials. Figure 4 plots the p-MMR topography of the high and low comprehenders at the corresponding grand average peak latencies in the multiple- and single-speaker condition.
For individual p-MMR peak amplitudes (i.e., the mean amplitude in the 40 ms window surrounding individual p-MMR peaks), electrode (F3, Fz, F4) * condition (single- or multiple-speaker) * comprehension group (HC, LC) mixed effect ANOVAs found no significant main effect of condition, F(1, 51) = .58, p = .45, partial η2 = .01, or vocabulary group, F(1, 51) = 1.83, p = .18, partial η2 = .04. The main effect of electrode was significant, F(2, 102) = 5.95, p = .004, partial η2 = .11. Bonferroni corrected post hoc pairwise comparison showed that overall, F3 had a larger amplitude than Fz and F4. The interaction between electrode and vocabulary level was significant, F(2, 102) = 6.78, p = .002, partial η2 = .12. None of the other interactions was significant. Bonferroni corrected post hoc analysis found that for the LC group, p-MMR had a larger amplitude at F3 than Fz and F4 while there was no significant difference between the electrodes for the HC group.
With regard to peak latencies, condition showed a significant main effect, F(1, 51) = 39.51, p <.001, partial η2 = 0.44, and main effect of vocabulary level was not significant, F (1, 51) = 2.46, p = .12, partial η2 = = 0.05. The interaction between the two was not significant, F(1, 51) = 0.24, p = .72, partial η2 = 0.01. Similar to the analysis with all participants, both the HC and LC groups had an earlier peak latency in the multiple- than in the single-speaker condition.
Figure 3 together with the statistical analyses showed that the p-MMR amplitudes were comparable between the high and low comprehenders, and both groups had an earlier p-MMR peak latency in the multiple- than in the single-speaker condition. Nevertheless, the two groups showed different p-MMR scalp distributions: while the p-MMR was symmetrically distributed for the HC group, it was left lateralized for the LC group.
General Discussion
In the current study, we investigated 20-month-old infants’ neural detection of vowel change with and without presence of speaker variability. One token of each nonword [ɣip] and [ɣIp] from each speaker constituted the multi-speaker stimuli, and in the single-speaker condition, one token of each from the same speaker were used as stimuli. For both conditions, the [ɣIp] occasionally embedded in the stream of [ɣip] elicited significant positive mismatch response (p-MMR), with the peak latency being 336 ms (significant window: 292-404 ms) and 356 ms (significant window: 336-404 ms) for the multiple- and single-token condition respectively. The presence of speaker variability did not attenuate p-MMR, as no significant p-MMR amplitude difference was found across the conditions. To investigate how vocabulary influenced categorization, the infants were median split into high and low comprehenders/producers based on their raw score of the Dutch version CDI (words and sentences). In terms of p-MMR peak amplitude, the high and low groups were comparable for both the single- and multiple-speaker conditions. Group specific patterns, however, were also observed. First, the low group showed a left-lateralized distribution while the high group showed a symmetrical scalp distribution of p-MMR for both the conditions. Second, in the multiple-speaker condition, the high group showed a more sustained p-MMR than the low group.
One crucial finding of the current study is that infants as young as 20 months were already able to neurally discriminate the acoustically similar vowels in the face of speaker variability. The p-MMR in the multiple-speaker condition was not attenuated compared to the single-speaker condition, where the brain only had to detect the acoustical difference. Furthermore, the p-MMR was not influenced by vocabulary. Previous studies have shown neural discrimination of native vowels in early infancy (Čeponien et al., Reference Čeponien, Lepistö, Alku, Aro and Näätänen2003; Cheour-Luhtanen et al., Reference Cheour-Luhtanen, Alho, Kujala, Sainio, Reinikainen, Renlund, Aaltonen, Eerola and Näätänen1995; Lee et al., Reference Lee, Yen, Yeh, Lin, Cheng, Tzeng and Wu2012; Marklund et al., Reference Marklund, Schwarz and Lacerda2019; Shafer et al., Reference Shafer, Yu and Garrido-Nag2012), yet using only one single token of each category as stimuli, it remains unclear whether infants were responding to the stimuli acoustically or categorically. By including the multiple-speaker condition, the current study demonstrated at 20 months, children have already acquired perceptual constancy of similar sounding vowels at the neural level, and the infant brain was able to disregard speaker information when discriminating the vowels. Thus, the abstract representation of native phonetic categories becomes well established early in life.
At this age, however, the mismatch response still showed a positive polarity, which was consistent with previous studies (Lee et al., Reference Lee, Yen, Yeh, Lin, Cheng, Tzeng and Wu2012; Shafer et al., Reference Shafer, Yu and Datta2010), and a shift to adult-like MMN is expected to occur at a later age. Some previous studies have shown that among infants and toddlers, the p-MMR can co-occur with a negativity at a late time window, yet whether such a late negativity reflects reorientation of attention or it is equivalent to adult MMN remains debated (Kushnerenko et al., Reference Kushnerenko, van den Bergh and Winkler2013; Shafer et al., Reference Shafer, Yu and Datta2010; Yu et al., Reference Yu, Tessel, Han, Campanelli, Vidal, Gerometta, Garrido-Nag, Datta and Shafer2019). In the current study, no significant late negativity was observed, except for the low comprehenders in the single-speaker condition. Since we only tested one age group, it is difficult to ascertain whether the late negativity will emerge at a later age, and whether the p-MMR latency and amplitude will change with age (Shafer et al., Reference Shafer, Morr, Kreuzer and Kurtzberg2000, Reference Shafer, Yu and Datta2010). Seeing that the current study made use of different stimuli, SOA, and procedure compared to the previous studies, no conclusion could be drawn with regard to the mechanisms underlying the emerging late negativity. It is important to compare the mismatch responses elicited by different stimuli at the same age as well as those elicited by the same stimuli but at different ages. Another developmental pattern that needs further investigation in the future is the scalp distribution of the p-MMR. In the current study, as can be seen on the topographic maps, the p-MMRs are largest at lateral site, while previous studies often reported frontal distribution of the MMRs (Morr et al., Reference Morr, Shafer, Kreuzer and Kurtzberg2002; Shafer et al., Reference Shafer, Yu and Datta2011; Yu et al., Reference Yu, Tessel, Han, Campanelli, Vidal, Gerometta, Garrido-Nag, Datta and Shafer2019). Other studies found that while p-MMR was left lateralized, MMN emerged from frontal central sites (Shafer et al., Reference Shafer, Yu and Datta2010). It might be that for the toddlers tested in the current study, the MMN was emerging, and the overlap between MMN and p-MMR created the lateral distribution. But again, the developmental change of MMR topography needs to be investigated by including more age groups.
As the vowels were presented in nonwords, the successful neural discrimination cannot be attributed to familiarity with the stimuli, nor to semantic knowledge. In addition, as the multiple-speaker condition always preceded the single-speaker condition, the infants had no chance to first establish targets from invariant stimuli in the single-speaker condition and then map the variable tokens to the targets. Speaker variation has been found to facilitate word learning among infants (Rost & McMurray, Reference Rost and McMurray2010), and it has been hypothesized that variation along linguistically irrelevant dimensions helps infants identify the phonologically relevant acoustical dimensions and consequently facilitates sound-meaning pairing. The current study, however, did not teach infants new words but tested what they had already learned by the time of the experiment. Evidently, these similar sounding nonwords had already been well contrasted in the infant brain by the time of the experiment. Therefore, the p-MMR in the multiple-speaker condition likely reflects existing categorical representation of the vowels.
Inconsistent with our hypothesis, the results do not support the enhancement effect of vocabulary on categorical discrimination of the vowels, given that the HC and LC group’s p-MMRs were comparable in terms of both amplitude and peak latency. Seeing that both the lexical restructuring model (LRM, (Metsala & Walley, Reference Metsala, Walley and Ehri1998) and the lexical-distribution models (Feldman, Myers, et al., Reference Feldman, Myers, White, Griffiths and Morgan2013; Swingley, Reference Swingley2009) argue that word knowledge facilitates phonetic learning, as words may provide specific contextual cues for discovering and disambiguating the phonetic categories, we speculate that the mutual influence between word learning and phonetic categorization might be more evident at an earlier age. The infants in the current study understood 318 words on average, and the number of words known by the LC group might have been sufficient to support discriminating the similar sounding vowels. In particular, perhaps knowing word forms without meaning might have sufficiently facilitated learning of phonetic categories (Carbajal et al., Reference Carbajal, Peperkamp and Tsuji2021). Whether it is word forms or word meanings that associate to phonetic category needs further investigation.
Nevertheless, group specific characteristics were observed. First, for the multiple-speaker condition, the high group showed a more symmetrical scalp distributions of p-MMR across the two conditions while for the low group, the p-MMR scalp distribution was left-lateralized. Second, the high group showed a more sustained p-MMR than the low group in the multiple-speaker condition. MMR has been found to last longer for more salient than less salient phonemic contrasts (Cheng et al., Reference Cheng, Wu, Tzeng, Yang, Zhao and Lee2015; Lee et al., Reference Lee, Yen, Yeh, Lin, Cheng, Tzeng and Wu2012), implying that for the same contrast, a longer lasting MMR of the high group might reflect a stronger response to the contrast. Our results seem to suggest that a large vocabulary might be helpful when infants discriminated between variable vowel categories. Interestingly, although different cortical lateralization has been found for vowel versus speaker processing, as well as for within- versus cross-category phoneme discrimination (Maurer et al., Reference Maurer, Bucher, Brem and Brandeis2003; Poeppel et al., Reference Poeppel, Guillemin, Thompson, Fritz, Bavelier and Braun2004; Shestakova et al., Reference Shestakova, Brattico, Soloviev, Klucharev and Huotilainen2004; Sittiprapaporn et al., Reference Sittiprapaporn, Tervaniemi, Chindaduangratn and Kotchabhakdi2005; Xi et al., Reference Xi, Zhang, Shu, Zhang and Li2010), in the current study, for neither group was the MMR scalp distribution different across conditions. Instead, regardless of whether speaker variation was present, the p-MMR differed across vocabulary levels. Therefore, it seems that vocabulary had a general effect, although not necessarily a facilitative one, on speech sound discrimination, independent of the presence of variability. Several previous studies have shown that speech perception in the first year of life, operationalized mainly as discrimination accuracy of single exemplars of different native speech phonetic categories, correlated with later vocabulary ability (Garcia-Sierra et al., Reference Garcia-Sierra, Rivera-Gaxiola, Percaccio, Conboy, Romo, Klarman, Ortiz and Kuhl2011; Kuhl et al., Reference Kuhl, Conboy, Coffey-Corina, Padden, Rivera-Gaxiola and Nelson2008; Singh, Reference Singh2019; Tsao et al., Reference Tsao, Liu and Kuhl2004). The current study found that, concurrently, different neural networks might be at play for categorical discrimination of vowels for toddlers with different vocabulary level. For now, it is hard to ascertain whether it is age or vocabulary that serves as the main driving force for the co-occurring development in word learning and phonological categorization between 18 and 24 months. Biological maturation, accumulative language input, development of cognitive skills, or a combination of all may all play a role (Werker et al., Reference Werker, Fennell, Corcoran and Stager2002; Werker & Curtin, Reference Werker and Curtin2005), and it is worth the effort for future studies to focus on this period and identify the crucial factors bolstering phonology and vocabulary development.
It should be acknowledged that two alternative explanations of the findings cannot be ruled out at this moment. One possibility is that the toddlers responded to the acoustic distance between the standards and the deviants rather than vowel categories. The acoustic distance might be smaller among the different tokens of the same than the different vowel type, and the toddler brain might have made use of such acoustic cues in discrimination. It would be informative if future studies can manipulate the stimuli in a way that the acoustic distance between variable tokens is equivalent both within and across the standard and deviant types. Second, unlike adult MMN, which is widely agreed to indicate discrimination (Bartha-Doering et al., Reference Bartha-Doering, Deuster, Giordano, am Zehnhoff-Dinnesen and Dobel2015; Fu & Monahan, Reference Fu and Monahan2021; Hisagi et al., Reference Hisagi, Shafer, Strange and Sussman2010, Reference Hisagi, Garrido-Nag, Datta and Shafer2015; Trainor et al., Reference Trainor, McFadden, Hodgson, Darragh, Barlow, Matsos and Sonnadara2003), infant p-MMR may simply reflect recovery from refractoriness. To eliminate the potentially different obligatory responses elicited by different sounds, it is important for future studies to include equiprobable presentation of the standard and deviant stimuli or to flip the standard and deviant for comparison. In addition, seeing that the current study tested normalization of intra-speaker variability, how toddlers normalize inter-speaker variability is unknown. It would be informative for future studies to investigate how vowel categorization may differ when different types of variation are present.
To conclude, children as young as 20 months were able to neurally discriminate similar sounding vowels successfully, regardless whether speaker variability was incorporated and regardless whether they had a small or large vocabulary. Nevertheless, the brain seemed to be activated differently as the result of vocabulary, leading to different scalp distribution of the mismatch responses.
Acknowledgements
This study was funded by the Dutch Research Council with the project number 275-89-034 and supported by The National Social Science Fund of China with the project number 17CYY041, and by Science Foundation of Beijing Language and Culture University (the Fundamental Research Funds for the Central Universities, Approval Number 21PT01). I thank Lisanne Geurts and Charlotte Koevoets for their help with data collection. I thank Maartje de Klerk, Frank Wijnen. Likan Zhan, Ruohan Chang, Vargehese Peter and the Babylab in Utrecht Institute of Linguistics for their very insightful discussions. I thank all the parents and babies who took part in the study.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0305000923000351.
Competing interest
The author(s) declare none.