1 Introduction
While there have been many studies in the last 30 years on the acoustic (Evers, Reetz & Lahiri Reference Evers, Reetz and Lahiri1998, Jongman, Wayland & Wong Reference Jongman, Wayland and Wong2000, Nowak Reference Nowak2006, Shadle Reference Shadle2006, Cheon & Anderson Reference Cheon and Anderson2008, Maniwa, Jongman & Wade Reference Maniwa, Jongman and Wade2009), perceptual (McGuire Reference McGuire2007, Cheon & Anderson Reference Cheon and Anderson2008, Li et al. Reference Li, Munson, Edwards, Yoneyama and Hall2011) and articulatory characteristics of sibilants (Narayanan, Alwan & Haker Reference Narayanan, Alwan and Haker1995), the large majority of these have been focused on the two-way distinction between alveolar /s/ and post-alveolar /ʃ/. Here we are concerned with the comparatively rarer three-way place contrast in sibilants in Polish. Apart from Swedish and Mandarin Chinese, Standard Polish is one of the very few languages that distinguishes lexically between dental /s/ (e.g. sali /sali/ ‘room, gen’), retroflex /ʂ/ (e.g. szali /ʂali/ ‘scale, gen’), and alveolopatal /ɕ/ (e.g. siali /ɕali/ ‘sown’) sibilants (Gussmann Reference Gussmann2007, Żygis, Pape & Jesus Reference Żygis, Pape and Jesus2012a).
In recent years, these three sibilants have been analysed physiologically for Polish in Toda, Maeda & Honda (Reference Toda, Maeda, Honda, Fuchs, Toda and Żygis2010), for Mandarin in Proctor et al. (Reference Proctor, Lu, Zhu, Goldstein and Narayanan2012), and in both these languages by Hu (Reference Hu2008). These studies have shown that the three sibilants differ articulatorily not only in tongue position, but also in tongue posture. The fricatives are also distinguished from each other by two other tongue shape properties. Firstly, whereas in /ʂ s/ the vertical orientation of the tongue tip is typically upward-facing, it is downward-facing for /ɕ/. Secondly, while the tongue tip tends to be curled back to a greater extent for /ʂ/ than for /s/, the degree to which it is retracted has been shown to be somewhat less in Polish and Mandarin than in Indian languages (Hamann Reference Hamann, Hall, Pompino-Marschall and Rochoń2002a, Reference Hamann, Baauw, Huiskes and Schoorlemmerb; Hu Reference Hu2008; Toda et al. Reference Toda, Maeda, Honda, Fuchs, Toda and Żygis2010): for these reasons, there is a greater resemblance in tongue shape between /s ʂ/ in Polish than in Indian languages.
There is some evidence from the physiological analysis of four Polish L1 speakers in Bukmaier et al. (Reference Bukmaier, Harrington, Reubold and Kleber2014) for greater variability in /ʂ/ than in the other sibilants. At a slow speech rate, /s ʂ/ were clearly differentiated in tongue-tip orientation such that /ʂ/ was a sub-laminal production in which the underside of the tongue tip/blade made contact with the place of articulation. However, at a fast speech rate, these orientation differences were much less in evidence such that /ʂ/ resembled /s/ in being supra- rather than sub-laminal. Hu's (Reference Hu2008) physiological analysis of Mandarin Chinese also pointed to a greater articulatory variability in /ʂ/ than in the other two fricatives.
As far as the acoustics are concerned, many studies in the last 50 years have shown that the place of articulation distinction between English /s ʃ/ can be based to a large extent on the spectral characteristics of the fricative noise (Whalen Reference Whalen1991, Shadle & Mair Reference Shadle and Mair1996, Evers et al. Reference Evers, Reetz and Lahiri1998, Stevens Reference Stevens1998, Jongman et al. Reference Jongman, Wayland and Wong2000, Shadle Reference Shadle, Cohn, Fougeron and Huffman2012): more specifically, the shorter front cavity in /s/ causes the energy in the spectrum to be shifted towards higher frequencies, so that both acoustically and perceptually (Fujisaki & Kunisaki 1977, Mann & Repp Reference Mann and Repp1980), a higher spectral centre of gravity (Forrest et al. Reference Forrest, Weismer, Milenkovic and Dougall1988) differentiates it from /ʃ/. For the three-way place contrast in Polish, these spectral characteristics in the noise can separate /s/ from the other sibilants (Żygis et al. Reference Żygis, Pape, Jesus and Jaskuła2014a, Reference Żygis, Pape, Jesus and Jaskułab), but as various studies (Jassem Reference Jassem1995, Żygis & Hamann Reference Żygis and Hamann2003, Nowak Reference Nowak2006) have shown, the centre of gravity in the noise by itself is generally insufficient for the /ʂ ɕ/ separation.
The issue of whether formant transitions contribute to the acoustic and perceptual distinction of place of articulation within fricatives is still unresolved. Some of the first studies to address this issue (Harris Reference Harris1958, Heinz & Stevens Reference Heinz and Stevens1961) showed that formant transitions were not necessary for the distinction between sibilants but that they were for the non-sibilant /f θ/ separation in English. On the other hand, although acoustic studies showed evidence of formant transitions extending well into the fricative noise (Soli Reference Soli1981), subsequent research suggested that vowel transitions were perceptually less important for the perceptual distinction between place of articulation in sibilant and non-sibilant fricatives (LaRiviere, Wintz & Herriman Reference LaRiviere, Winitz and Herriman1975, Jongman Reference Jongman1989). However, most of these studies were based on languages with only two sibilant fricatives. By contrast, a more recent cross-linguistic investigation by Wagner, Ernestus & Cutler (Reference Wagner, Ernestus and Cutler2006) showed that the effectiveness of formant transition cues was language-dependent: more specifically, listeners were shown to rely on formant transitions to a greater extent in languages like Polish which has fricatives such as /ʂ ɕ/ that are largely undifferentiated in the fricative noise. These results were consistent with those by Nowak (Reference Nowak2006) in which L1 Polish listeners identified Polish sibilants from isolated sections of friction noise and in VCV sequences with the transitions into the following vowel removed. Nowak's (Reference Nowak2006) results showed that, while fricatives could be reliably identified from the noise section, formant transitions were essential for the separation of /ʂ ɕ/ in VCV sequences. Compatibly, Toda et al. (Reference Toda, Maeda, Honda, Fuchs, Toda and Żygis2010) showed how the quite different tongue shapes for /ʂ ɕ/ contributed to the differences between these sibilants in vowel formant transitions.
Studies of the acquisition of Polish sibilants have shown that children acquire /ʂ/ relatively late and typically after the other sibilants have been acquired (Łukaszewicz Reference Łukaszewicz2006, Reference Łukaszewicz2007). The articulatory instability in /ʂ/ and the findings from language acquisition might also be related to the diachronic change of the three-way /s ʂ ɕ/ to a two-way distinction as a result of an /s ʂ/ merger in both the Min variety of Mandarin (Duanmu Reference Duanmu2006, Chuang & Fon Reference Chuang and Fon2010) and in several Polish dialects (Żygis, Pape & Czaplicki Reference Żygis, Pape and Czaplicki2012b). One of the main motivations for the present study was to investigate the synchronic basis for the diachronic collapse of the /s ʂ/ contrast towards /s/. The more specific aims were to analyse both the physiological and acoustic characteristics of these three fricatives in order to assess whether the identification of /ʂ/ is disadvantaged in comparison with the other two fricatives. In order to do so, we carried out an electromagnetic articulographic (henceforth EMA) study of nine Polish L1 speakers producing these sibilants and assessed the acoustic distinctiveness of the three fricatives from each other in both the noise and transitions.
2 Method
2.1 Data collection and speakers
Acoustic and speech movement data were acquired using electromagnetic articulography in a soundproof booth at the IPS in Munich (AG501, Carstens Medinzinelektronik) in order to obtain measurements of the horizontal, vertical, and lateral position of the articulators. For the EMA recordings, two sensors were placed on the tongue (Figure 1): one on the midline 1 cm behind the tip of the tongue (TT) and the other on a level with the molar teeth at the tongue back (TB). Additionally, two sensors were placed on the upper and lower lip, i.e. on the skin just above and below the lips. Four additional sensors were fixed to the maxilla (to the tissue just above the teeth), the nose bridge, as well as to the left and right mastoid bones in order to correct for head movement. For the present study, only the data from the sensor attached to the tongue tip were analysed. The acoustic speech signal was recorded synchronously with the physiological data using a Sennheiser ME66 supercardioid microphone with bass rolloff filter turned on (−6 dB at 200 Hz) positioned at a distance of approximately one metre in front of the subject. Audio data was recorded with a National Instruments Compact DAQ multichannel data acquisition front-end, with USB connection to a notebook computer. Synchronization of the audio and speech movement signal was carried out in the post-processing of the data after the recording session (see Hoole & Zierdt Reference Hoole, Andreas, Maassen and van Lieshout2010 for further details of the post-processing of acoustic and articulatory data).
The subjects in this experiment were nine L1 Standard Polish speaking adults spanning an age range between 19 and 28 years and included four male and five female speakers. Six speakers were born and went to school in dialectal regions with a three-way sibilant contrast (two each from Silesia, Lesser Poland and Greater Poland). The remaining three speakers were born and lived most of their lives (i.e. went to school) in dialect regions in which the alveolar/retroflex contrast is neutralized (two from Mazovia and one from Kashubia). These three subjects were nevertheless included in our analysis because they were judged by an L1-Polish speaker with linguistics training to be speakers of Standard Polish with no perceptible regional colouring. None of the participants had lived outside of Poland for more than two years at the time of recording.
2.2 Speech material and experimental set-up
The participants produced symmetrical CVCV (e.g. /sɛsɛ/) non-words (in which C=/s, ʂ ɕ/ and V=/a ɛ ɔ/) as well as Polish disyllabic real words (Table 1) with initial CV sequences (in which C=/s ʂ ɕ/ and V=/a ɛ ɔ/). All target words were embedded in the carrier phrase ‘Ania woła [TARGET WORD] aktualnie’ (literally ‘Ania shouts [TARGET WORD] currently’), where the target word was produced with a nuclear pitch accent. The participants read the sentences aloud as they were automatically presented to them on a computer screen one at a time in randomized order. In cases of mispronunciations and productions of incorrect prosody, the participants were asked to repeat the sentence.
The recording session consisted of ten blocks alternating between slow and fast speech rates. In order to define individual speech rates as well as to adjust the corresponding recording time, participants were asked to read examples of the speech material at a self-selected fast and slow speech rate in a pretest prior to the actual recording. The display incorporated a progress bar linked to the desired speech rate that was defined for each speaker and condition based on the mean durations of the pre-recording and that indicated the time frame for each token. For each speech rate, each of the 22 target words containing nine non-words and 13 real words (Table 1) was repeated ten times in randomized order. Some word initial CV sequences occur more often in the onsets of Polish disyllabic real words e.g. /sa/-, /ɕa/- and /ʂa/-word onsets, as a result of which there were more (near) minimal pairs for these CV sequences (see Table 1, row 1: /sara/, /sama/, /sava/; row 4: /ʂari/, /ʂafa/; row 7: /ɕatka/, /ɕanɔ/). Because of this skewed distribution of CV sequences, the materials for this study included between two (/ɕa/ and /ʂa/ onsets) and three (/sa/ onsets) target words, with other, rarer sequences only being represented with one target word (e.g. /sɛ sɔ ʂɛ ʂɔ ɕɛ ɕɔ/). Table 1 contains the complete distribution of CV sequences.
The experiment contained 3960 (22 target words × 10 repetitions per speech rate × 9 speakers) sentences, of which 3895 sentences were analysed in this study. The data loss of 65 tokens was due to technical problems during the recording session and post-processing. The total number of analysed sibilant–vowel combinations for both real and non-words is given in Table 2.
2.3 Data analysis
2.3.1 Physiological analysis
The post-processing of the physiological raw data was done semi-automatically in Matlab (version MathWorks MATLAB R2012a) including rotation of the data so that they were parallel to the occlusal plane (Hoole & Zierdt Reference Hoole, Andreas, Maassen and van Lieshout2010).
The articulatory annotation of the three sibilants was based on the vertical movement of the tongue tip (TT) and the TT tangential velocity. Physiological labels included seven different landmarks (Figure 2). Typically, a complete CVC movement cycle was divided into a CV or opening phase, a nucleus, or quasi target phase, and a VC or closing phase. Onsets and offsets of opening and closing gestures were determined by using a 20% threshold criterion of the tangential velocity signal (Hoole & Mooshammer Reference Hoole, Mooshammer, Auer, Gilles and Spiekemann2002, Hoole et al. Reference Hoole, Bombien, Kühnert, Mooshammer, Fant, Fujisaki and Shen2009). The vowel nucleus was then defined as the interval between CV offset and VC onset.
Using the landmarks in Figure 2, we extracted data from the vertical and horizontal positions of the tongue tip (TT) and the tongue dorsum (TD). We also analysed the orientation of the TT since this potentially provided information about the retroflex, in which the tongue tip is often known to be curled upwards (Ladefoged Reference Ladefoged2001; Hamann Reference Hamann, Hall, Pompino-Marschall and Rochoń2002a, Reference Hamann, Baauw, Huiskes and Schoorlemmerb; Toda et al. Reference Toda, Maeda, Honda, Fuchs, Toda and Żygis2010; Bukmaier et al. Reference Bukmaier, Harrington, Reubold and Kleber2014).
2.3.2 Acoustic analysis
The synchronized acoustic data was digitized at 25,600 Hz and automatically segmented and labelled using forced alignment (Munich Automatic Segmentation tool; Schiel Reference Schiel2004). Calculations were made of spectra (256 point discrete Fourier transform with a 40 Hz frequency resolution, 5 ms Blackman window, and a frame shift of 5 ms) and of formant frequencies (F1-F4; pre-emphasis of −0.8, 20 ms Blackman window with a frame shift of 5 ms).
For the acoustic analysis of the fricative noise, spectra were extracted at the temporal midpoint between the acoustic onset and offset of each sibilant. These spectral data were reduced to a set of coefficients using the discrete cosine transformation (DCT) after converting the Hz to the mel scale. For an N-point mel-scaled spectrum, x(n), extending in frequency from n = 0 to N–1 points over the frequency range of 500–3500 mel (414-10 313 Hz), the mth DCT-coefficient Cm (m = 0, 1, 2) was calculated with the following equation:
These three coefficients Cm (m = 0, 1, 2) encode the mean, the slope, and curvature, respectively, of the signal to which the DCT transformation was applied (Harrington Reference Harrington2010). Since preliminary studies of these had shown that the sibilants were optimally distinguished in the fricative noise from C1 and C2 (i.e. from the slope and curvature of the spectrum respectively), all further quantifications of the sibilants were based on these coefficients.
The articulatory and formant data were speaker-normalized using standard normalization (Lobanov Reference Lobanov1971). More specifically, where xP.i.T is a raw value of an articulatory or formant parameter P from speaker i at time point T, the corresponding normalized value XP.i.T was given by the following formula:
When normalization was applied to the data in the fricative noise, (xP.i.m , xP.i.s ) were calculated from all frames of data between the fricatives' acoustic onset and offset; when normalization was applied to the formant parameters, (xP.i.m , xP.i.s ) were calculated from all frames extending between the acoustic vowel onset and offset.
We also carried out a Gaussian classification of the acoustic (spectral, formant) data in order to determine the degree of separation between the three fricative places of articulation. Classification was based on quadratic discriminant analysis (Srivastava, Jermyn & Joshi Reference Srivastava, Jermyn and Joshi2007) in which there was a training and a testing stage. During the training stage, each fricative class consisting of a number of observations in a two-dimensional acoustic space was modelled as a bivariate Gaussian distribution; in the testing phase, observations were classified as one of the fricative classes based on the greatest posterior probability. The relationship between training and testing was accomplished using the leave-one-out procedure in which, iteratively for each of the nine speakers in turn, a given speaker's data were classified following training on the data of the other eight speakers. For the fricative noise, the two parameters were C1 and C2 as defined above extracted at the acoustic temporal midpoint of the fricative; for the vowel, the two parameters were F2 and F3 at the acoustic vowel onset. In vowel classifications, training and testing were additionally carried out using this leave-one-out procedure separately in each of the three /a ɛ ɔ/ vowel contexts. The classifications as described above were separately accomplished in the slow and fast rate contexts (thus four classifications: two (slow/fast) based on C1 and C2 and two (slow/fast) on F2 and F3 at the acoustic vowel onset).
3 Results
The results are presented below separately for the fricative noise (Section 3.1) and for vowel transitions (Section 3.2). In both cases, the aim was to determine the extent to which there was separation between the three fricative places of articulation and to assess how far these two sets of cues provide complementary information for this purpose.Footnote 1
3.1 Frication
3.1.1 Physiological analysis
The aggregated tongue-tip data in Figure 3 shows a clear separation between the fricatives for each of the nine speakers. For most subjects, /s ʂ/ had the most fronted and retracted positions respectively, with /ɕ/ located along the front–back dimension between the other two sibilants. Additionally, the tongue tip was generally lower for /s/ than for the other two fricatives; and /ʂ/ tended to reach the highest position, perhaps as the tongue tip unfolded from an initially curled position.
Subsequent analyses showed that various combinations of two physiological parameters provided a very clear separation between the three fricative places of articulation. One of the most effective of these was for the combination of the horizontal position of the tongue tip and its vertical orientation (Figure 4). Recall that the latter provides information about the sensor's rotation along the front–back axis. Since the tongue tip can be expected to be curled back in /ʂ/, but not in /ɕ/, then the sensor which is affixed just behind the tongue tip should be rotated for /ʂ/ about the axis that is perpendicular to the sagittal plane – or at least to a greater extent than it is in /ɕ/. This, as Figure 4 shows, was the case for eight out of nine speakers, in which the rotation was greater for /ʂ/ than for /ɕ/: note in particular that this is the distinguishing feature for two speakers (P5, P8) for whom /ʂ ɕ/ were otherwise undifferentiated as far the horizontal position of the TT was concerned. Figure 4 also shows that, with the exception of P6, there was almost complete separation between the three fricatives on these two dimensions for the remaining speakers. Thus the general conclusion is that /s ʂ ɕ/ were separated from each other as far as tongue-tip posture is concerned.
3.1.2 Acoustic analysis
We now consider the extent to which the clear physiological separation between the three fricatives was matched acoustically. The ensemble-averaged spectra in Figure 5 show that /s/ was separated from the other two fricatives by greater energy at higher frequencies, but that the ensemble-averaged spectral shapes for /ʂ ɕ/ were quite similar (see Appendix). We tested various combinations of spectral parameters at the fricatives' temporal midpoint including spectral moments (Forrest et al. Reference Forrest, Weismer, Milenkovic and Dougall1988). The two which were most effective in separating the places of articulation were those that are proportional to the linear slope (C1 ) and curvature (C2 ) derived from the discrete cosine transformation, calculated after transforming the frequency axis to the mel scale as described in 2.3. For C1 , if a regression line were drawn through the three spectra, then, as Figure 5 suggests, /s/ would be differentiated from the other two by its rising as opposed to falling slope. For C2 , the greater the resemblance of the ensemble-averaged spectrum to a parabolic shape, then the greater the values on C2 . There is a clear parabolic shape in evidence for the ensemble-averaged /s/ spectrum in Figure 5, and the generally higher amplitude levels over a mid-frequency range for /ɕ/ than for /ʂ/ may provide some basis for their differentiation on this parameter.
For most speakers, Figure 6 shows an overlap of /ʂ ɕ / in the C1 × C2 space, whereas /s/ was clearly separated from the other two sibilants (except for speaker P6). The data in Figure 6 were consistent with the classifications (see Section 2.2 above for details) which showed for the slow rate of speech (Table 3) 96% correct classification for /s/ as opposed to 77% and 63%, respectively, for /ʂ ɕ/. Table 3 shows a high degree of /ʂ ɕ/ confusion for the slow rate of speech with 25% of /ɕ/ being misclassified as /ʂ/ and 23% of /ɕ/ misclassified as /ʂ/. Table 3 also shows that the classification scores at the fast rate of speech showed a broadly similar pattern.
We tested the influence of place of articulation and rate on classification scores. We also tested the influence of whether or not the sibilant had occurred in a real or non-word. For this purpose, we ran a mixed model with the binary response correctly or incorrectly classified consonant as the dependent variable, with fixed factors that included place of articulation (three levels: /s ʂ ɕ/), word-type (two levels: real word/non-word), and rate (two levels: slow/fast); and with the speaker (nine levels) and word (22 levels: the separate words and non-words in Table 1) as random factors. We also included all the interaction terms between the fixed factors in the model. We assessed the influence of word-type and rate by comparing two models: one with all the factors included, as outlined above; and one that differed from this by dropping word-type and rate. A comparison of these two models (one full with another without word-type and rate) showed no significant differences: thus neither word-type nor rate had any significant influence on classification scores. Predictably, classification scores were significantly influenced by place of articulation (χ2 = 67.2, p < .001).
3.2 Coarticulatory effects on adjacent vowels
In the preceding section, we showed that the very clear separation between the three fricatives based on the tongue configuration was not matched by the acoustic analysis of the fricative noise, which showed a substantial /ʂ ɕ/ confusion. Here we apply a similar type of analysis to the onset of the transitions into the vowel.
3.2.1 Physiological analysis
With the exception of speaker P4 (for whom the TB trajectories of dental and alveolopalatal were quite similar), Figure 7 shows that the vertical TB position was higher in vowels following /ɕ/ (indicated by higher vertical TB values), while in vowels following /s ʂ/ the vertical TB position was lower (indicated by lower vertical TB values). These findings suggest that /ɕ/ exerted a strong coarticulatory influence on the following vowel. Figure 7 also shows that, with the exception of speaker P6, /ʂ/ had a more retracted tongue body position compared to /s/. Thus, there is considerable information in the tongue dorsum at the vowel onset and often throughout the vowel for the distinction between the three fricatives.
3.2.2 Acoustic analysis
For all speakers, the F2 transition data in Figure 8 shows higher F2 values following /ɕ/, consistent with the observations of the physiological analysis in Figure 7. Although /ʂ ɕ/ overlapped in F2, they were separated to a certain extent by the lower F3 for /ʂ/.
Figure 9 illustrates further the strong coarticulatory influence of /ɕ/ on the vowels causing marked F2 raising for all vowels and F1 lowering in an /ɛ/ context. Thus, these data provide further evidence that vowels in a /ɕ/ context are strongly palatalized.
The results of the leave-one-out classification based on a two-parameter model of F2 onset and F3 onset show a high classification score of 91% at the slow rate for /ɕ/ with equal confusion between the other two fricatives on these parameters (Table 4). Although the identification rates of /s ʂ/ at the slow rate (82%, 71% respectively) were well above chance level (33%), there was also marked confusion between them (26% /ʂ/ misclassified as /s/ and 13% of /s/ misclassified as /ʂ/). Table 4 also shows a similar pattern of classification scores for the fast speech condition. A mixed model with the binary response correct/incorrect classification score based on these classifications from the combined F2 and F3 onset and with the same fixed and random factors as deployed earlier (Section 3.1.1) showed no significant effects for either rate or for word-type. Thus once again, the classification scores were unaffected by rate or word-type (whether or not the word was a real word or a non-word). Predictably, the classification scores were significantly influenced by consonant place of articulation (χ2 = 23.6, p < .001).
4 General discussion
The main aim of the present study has been to shed light on the acoustic and articulatory characteristics of the three Polish sibilants /s ʂ ɕ/ and to test whether the greater phonetic instability in /ʂ/ may be the source of the reduction of the three-way contrast to a two-way distinction that has been observed in certain Polish varieties and in Mandarin Chinese. We begin by considering the degree to which the fricatives were separated in the noise and transitions in turn.
Earlier studies have generally reported a very high separation between /s ʂ ɕ/ when listeners are presented with noise sections alone (Nowak Reference Nowak2006). Our physiological data for nine speakers shows quite unequivocally that /s ʂ ɕ/ were all distinguished on the basis of the position and configuration of the tongue. In particular, /s/ was (predictably) shown to have a very forward tongue-tip constriction, and it was most retracted for /ʂ/: this result is consistent with a physiological analysis of fricatives in Mandarin Chinese by Hu (Reference Hu2008) and Proctor et al. (Reference Proctor, Lu, Zhu, Goldstein and Narayanan2012), who showed a more retracted position for /ʂ/ than for /s ɕ/. The tongue-tip retraction in our data came about because the tip was (as is typical for retroflex consonants) curled back towards the hard palate. This posture was also the main characteristic that differentiated it from /ɕ/; that is, /ɕ ʂ/ differed according to the rotation of the tongue tip about the axis that is perpendicular to the sagittal plane. Just these two parameterizations of the tongue tip (horizontal position, rotation) were sufficient for an almost complete separation between /s ʂ ɕ/. Thus the very high perceptual distinction between these fricatives based on noise found by Nowak (Reference Nowak2006) is likely to be related to their marked physiological differences in the tongue position and orientation found in our study.
Our acoustic analysis of the fricative noise was consistent with that of Nowak (Reference Nowak2006) and others (e.g. Jassem Reference Jassem1995) in showing a very clear separation between /s/ and the other two categories based on the greater concentration of energy at higher frequencies. According to Halle & Stevens (Reference Halle, Stevens, Kiritani, Hirose and Fujisaki1997), theoretical considerations of vocal tract modeling suggest that energy typically found in the region associated with the second formant frequency should be lower for /ʂ/ than for /ɕ/; thus, /ɕ/ has a much narrower palatal constriction that suppresses back cavity resonances leading to an energy increase in the spectral region close to F2. In general, such a difference should result in a slightly greater weighting of spectral energy towards the lower frequency values for /ʂ/ than for /ɕ/. This is exactly what is evident from the ensemble-averaged spectra in Figure 5 above, which show a spectral peak in the vicinity of 2 kHz (i.e. in the region of F2) for /ʂ/ which is absent for /ɕ/. These observed differences are consistent with the findings by Li, Edwards & Beckman (Reference Li, Edwards and Beckman2007), who found that energy in this F2 region of the noise spectrum effectively distinguished between /ʂ ɕ/ in Mandarin Chinese. Beyond these differences, and consistently with Nowak (Reference Nowak2006), our study shows very similar spectral shapes for /ʂ ɕ/: that is, /ʂ ɕ/ differed principally in that a similar spectral shape occurred at slightly lower frequencies for the retroflex. Compatibly, Żygis & Hamann (Reference Żygis and Hamann2003) showed that a lower spectral centre of gravity of the noise separated /ʂ/ from /ɕ / for a female speaker, although not in their male speaker. In the semi-open classification test in which we trained and tested the three fricative categorizations based on a DCT parameterization, although their classification rates were well above chance, around 25% of /ʂ ɕ/ were nevertheless confused with each other. This result suggests that, in spite of the very clear physiological distinction, the acoustics of the fricative noise alone are unlikely to provide sufficient information in more casual, spontaneous speech for their separation.
Numerous studies in the last 50 years have shown that formant transitions provide contributory information to fricatives’ place of articulation distinctions. This was shown to be especially so for the non-sibilants /f θ/ in English (Harris Reference Harris1958). However, other studies have shown that formant transitions into the following vowel can also be important for the /s ʃ/ separation (Delattre, Liberman & Cooper Reference Delattre, Liberman and Cooper1962, Soli Reference Soli1981, Whalen Reference Whalen1991, Lisker Reference Lisker, Grønnum and Rischel2001, Gordon, Barthmaier & Sands Reference Gordon, Barthmaier and Sands2002, Wagner et al. Reference Wagner, Ernestus and Cutler2006, Li et al. Reference Li, Edwards and Beckman2007; see also Wagner et al. Reference Wagner, Ernestus and Cutler2006 for a comprehensive review). Drawing on an analysis of Shona fricatives, Bladon, Clark & Mickey (Reference Bladon, Clark and Mickey1987) were among the first to suggest that formant transitions may be critical in languages with a three-way place of articulation contrast in sibilants. The results from our study show that the second formant transition provided especially salient information for identifying /ɕ/. In agreement with Nowak (Reference Nowak2006) and Sawicka (Reference Sawicka and Wróbel1995), our results also show that /ɕ/ exerted a strong coarticulatory influence on adjacent vowels causing them to be palatalized: in particular, our physiological data showed a raised tongue-dorsum position at vowel onset extending well into the vowel for /ɕ/ for all speakers and contexts (Figure 4 above) and a concomitant raised F2 throughout the first half of the vowel once again in all speakers and contexts. This finding is also consistent with a perceptual study by Lisker (Reference Lisker, Grønnum and Rischel2001) who showed that English listeners were able to separate /ʂ ɕ/ quite reliably only on the basis of acoustic information in the vowel. A further new finding from our study is that it is not just F2 but also F3 that may contribute to this distinction. The acoustic theory of speech production predicts that retroflex consonants should be associated with F3 lowering (Fant Reference Fant1960) and the results from our study show that F3 of /ʂ/ is lower than for the other two fricatives. F3 lowering for /ʂ/ was also found in the acoustic analysis of the Toda language by Gordon et al. (Reference Gordon, Barthmaier and Sands2002). The main result from our semi-open categorizations based on a two-dimensional space of F2 and F3 at vowel onset was that classification scores were well above chance and that almost 90% of /ɕ/ could be identified from this information in the vowel. While the classification scores for /ʂ/ are high at just over 70%, the same data also show that there is substantial /s ʂ/ confusion such that 12% of /s/ were misclassified as /ʂ/ and 26% of /ʂ/ as /s/. However, this confusion would presumably be resolved in combination with the fricative noise, which according to our analyses enabled an almost 95% separation of /s/ from the other two fricative categories. Overall then, the general finding from this study is that the fricative noise provides positive information for the separation of /s/ from /ʂ ɕ/ and that transitions distinguish /ɕ/ from /s ʂ/; therefore, the successful identification of /ʂ/ from acoustic data must depend on information both in the noise (to separate it from /s/) and on information in the vowel (to separate it from /ɕ/). Our study shows that /s/ can be distinguished from the other two fricative categories with reference to information in the noise alone (if the energy in the spectrum of the noise is concentrated in the upper part of the spectrum) and /ɕ/ can be separated from the other two fricative categories using information in the vowel (if F2 at vowel onset is high). But on the other hand, /ʂ/ requires for its identification two sets of cues to separate it from the other two fricative categories: both in the noise (the energy must be concentrated in the lower part of the spectrum to distinguish it from /ɕ/) and in the vowel (F2 and F3 must be low to distinguish it from /s/).
Our study also showed in contrast to an earlier analysis in Bukmaier et al. (Reference Bukmaier, Harrington, Reubold and Kleber2014) that rate had no effect either on the fricative noise nor on the vowel transitions. This was not because the speakers did not vary speaking rate: for every one of the speakers, the duration of both the fricative noise and of the vowel was less at the fast than at the slow tempo. We currently have no explanation for the divergent findings between the present study and that of Bukmaier et al. (Reference Bukmaier, Harrington, Reubold and Kleber2014) but can tentatively conclude that rate effects are unlikely to be a synchronic factor involved in diachronic /ʂ/ attrition.
Finally, we consider the issue of whether the retroflex consonant is likely to be the most unstable of the three categories both from a synchronic and diachronic perspective. The instability of the retroflex has been suggested by both Duanmu (Reference Duanmu2006) and Nowak (Reference Nowak2006) independently of this study, who point to the likelihood of the collapse of a three-way to a two-way contrast in many varieties of Mandarin and Polish, typically because of a merger of the dental and retroflex consonants. Similarly, the Taiwan variety of Mandarin lacks the three-way sibilant contrast found in standard Mandarin because the retroflex is frequently substituted by the dental fricative (Chuang & Fon Reference Chuang and Fon2010) under the influence of Min (which lacks retroflex consonants). Their study also showed that under prosodic prominence speakers typically only enhanced one of the two /s ʂ/ fricatives, rather than both, and in most cases the enhancement was in /s/. As Ladefoged & Bhaskararao (Reference Ladefoged and Bhaskararao1983) point out, it is the complexity of gestures involved in the production of retroflex consonants which may explain not only the type of diachronic changes noted above, but also why they are typologically rare, occurring only in languages with large coronal inventories (i.e. there is no known language that has retroflex consonants as the only coronal). To this we would add that it is perhaps not just the articulatory complexity but also the non-linear relationship to acoustics that may make /ʂ/ unstable: that is, whereas in our study retroflex consonants were unambiguously separated from the other two categories on the basis of tongue position, they remained highly confusable with both /ɕ/ in the fricative noise and with /s/ in the vowel. Thus /ʂ/ may be an example of what in Lindblom's (Reference Lindblom, Hurford, Studdert-Kennedy and Chris1998) model is considered to be a high-cost articulation involving complex articulatory maneuvers that nevertheless effect only a limited degree of acoustic or perceptual salience in relation to other fricative categories with which they contrast.
Studies and analyses of child language acquisition also point to the relative instability of /ʂ/. Some studies of Polish have suggested either that /ʂ/ is only acquired after /s ɕ/ (Łukaszewicz Reference Łukaszewicz2006) and/or that the contrast between dental and retroflex places of articulation emerges quite late (Łobacz Reference Łobacz1996). Moreover, Nittrouer & Studdert-Kennedy (Reference Nittrouer and Studdert-Kennedy1987) and Nittrouer (Reference Nittrouer1992, Reference Nittrouer2002) provide evidence that children rely much more than adults on dynamic than static information for phonetic categorization: for example, young children make far greater use of vowel transitions than the noise for fricative categorization. With increasing age, this relationship changes so that they progressively take greater advantage of the information that is available in the fricative noise. In terms of the present Polish data, such a model predicts that the /s ʂ/ distinction would be most vulnerable and prone to confusion: this is both because there is, as the present study shows, insufficient information for the clear separation of the /s ʂ/ distinction based on vowel information and because children would be, according to Nittrouer's model, less able to take advantage of the critical cues for the separation of /ʂ/ from other fricatives in the comparatively much more static noise section. Further empirical analyses of these fricatives need to be conducted in order to test whether the diachronic instability of /ʂ/ has its origins in the greater confusion of the production and perception of /s ʂ/ by children that is predicted by the results of the present study.
Acknowledgements
This research was supported by ERC grant number 295573 ‘Sound change and the acquisition of speech’. We are grateful to three anonymous JIPA reviewers for their comments on an earlier version of this paper.
Appendix. Calculating spectra using the multitaper methodology
It has been suggested by a reviewer that ensemble-averaging requires special conditions following a multitaper methodology (Jesus & Shadle Reference Jesus and Shadle2002; Shadle Reference Shadle2006, Reference Shadle, Hardcastle and Laver2010). We tested whether this was so by calculating spectra using a multitaper approach in Matlab (Percival & Walden Reference Percival and Walden1993). We used the default settings with a value of four for the time-bandwidth product (which in turn determines the number of tapers used); the multitaper analysis also used Thomson's (Reference Thomson1982) adaptive nonlinear method for combining the individual spectral estimates. As Figure A1 shows, we obtained almost identical results using our approach (Figure 5 in the main text above) and the multitaper method. Our results are therefore consistent with those in Reidy (Reference Reidy2015) showing that, while the multitaper approach may be beneficial to estimating peaks and troughs in the spectrum, its use makes very little difference to parameterizations such as spectral moments (or the types of DCT coefficients we have used here) that are based on the sum of amplitude estimates.