1 Introduction
Udmurt, a Uralic language spoken in Russia, is commonly described as having fixed stress: word stress regularly targets the final syllable of a word, in all word classes (Lytkin & Tepliashina Reference Lytkin and Tepliashina1962; Winkler Reference Winkler2001, Reference Winkler2011).Footnote 1 This is illustrated in (1).
At the same time, there are several classes of morphosyntactically conditioned exceptions to the stress-finality in Udmurt. These include, e.g., imperative verbs, which regularly have initial stress (Lytkin & Tepliashina Reference Lytkin and Tepliashina1962; Csúcs Reference Csúcs1990; Winkler Reference Winkler2001, Reference Winkler2011), as shown in (2). Indicative and imperative verbs frequently form minimal pairs that only differ in stress placement, as illustrated by (1b) and (2).
In this paper, we report on two studies aimed at investigating the acoustic correlates of stress in Udmurt: vowel duration, intensity, fundamental frequency (f0), and vowel height (the properties of the first formant, F1). The first study targets the realization of nominals (mainly nouns and adjectives), as illustrated by (1a). The second study compares the stress properties of minimal pairs, as illustrated by (1b) and (2). Following the methodological suggestions in Roettger & Gordon (Reference Roettger and Gordon2017), we also control for the information-structural context and elicit the experimental items both when focused (F) and non-focused (non-F).
This paper is structured in the following way. Section 2 introduces the relevant aspects of Udmurt phonology (2.1) and the stress system (2.2), summarizes the existing work on the phonetics of stress in Udmurt (2.3), and provides a brief overview of the existing work on the acoustic marking of stress and focus in a variety of languages (2.4). Section 3 provides the information on the methods: the stimuli (3.1), experimental procedure and participants (3.2), data processing (3.3), and analysis (3.4). Section 4 reports on the results of two studies and is organized by acoustic measures: duration (4.1), intensity (4.2.), fundamental frequency (f0) (4.3), and vowel height (F1) (4.4). Section 5 contains the discussion of the results, providing a summary of the main findings (5.1), information about interspeaker variation (5.2), and a preliminary Autosegmental-Metrical interpretation of the f0 findings (5.3). Section 6 concludes.
2 Previous work
2.1 Relevant aspects of Udmurt phonology
Udmurt has a seven-vowel system. The positions of the vowels in the vowel space (based on Vakhrushev & Denisov Reference Vakhrushev and Denisov1992: 26–27; Winkler Reference Winkler2011: 18) and transliterations are provided in Table 1. Since Udmurt uses the Cyrillic script, the Cyrillic orthographic symbols and their equivalents in the Finno-Ugric transcription (FUT), also called the Uralic Phonetic Alphabet (UPA) (see Setälä Reference Setälä1901), which is standardly used for the transliteration of Udmurt in the field of Finno-Ugristics, are provided in parentheses.
2.2 Final and non-final stress in Udmurt
Stress in Udmurt is commonly described as targeting the final syllable of a word. Final stress is not conditioned by morphological structure: with the addition of inflectional suffixes to the stem, stress shifts to the rightmost one (Winkler Reference Winkler2001: 10; Vakhrushev & Denisov Reference Vakhrushev and Denisov1992: 64). Final stress is also regularly found in borrowings: e.g., /kɲiˈɡa/ < Russ. /ˈkɲiɡa/ ‘book’ – though this may not hold in cases of intensive bilingualism (Winkler Reference Winkler2011: 11) and may depend on the type of borrowing (Vakhrushev & Denisov Reference Vakhrushev and Denisov1992: 66). Even if a Russian borrowing retains its stress pattern, the inflected forms adopt the Udmurt pattern: the rightmost inflectional affix carries stress (Winkler Reference Winkler2011: 22).
At the same time, there are some morphosyntactically conditioned exceptions to the stress-finality: e.g., imperative verbs regularly have initial stress, as was shown in (2) above. Stress does not create important phonological contrasts, though: it only differentiates the members of minimal pairs formed by 3sg indicative verbs and 2sg or 2pl imperative verbs, depending on conjugation class (Tarakanov Reference Tarakanov1959: 175).
Udmurt verbs form two conjugation classes: Conjugation I and II, also called ɨ-verbs and a-verbs, respectively, based on the final vowel of the stem (which is visible in the infinitival form: e.g., /budɨ-nɨ/ ‘grow-inf’ and /vala-nɨ/ ‘understand-inf’). In the ɨ-verbs, minimal pairs are formed by present-tense 3sg indicatives and 2pl imperatives; in the a-verbs, minimal pairs are formed by present-tense 3sg indicatives and 2sg imperatives. This is shown in (3), with the minimal pairs boldfaced.Footnote 2
Similarly, negated verbs, which are preceded by a negative auxiliary in Udmurt (see Edygarova Reference Edygarova, Miestamo, Tamm and Wagner-Nagy2015), are also stressed on the initial syllable: /ɘm ˈt͡ʃaʃete/ ‘we didn’t make noise’ (Lytkin & Tepliashina Reference Lytkin and Tepliashina1962: 47; Winkler Reference Winkler2011: 22). Outside of the realm of verbs, initial stress is found e.g., in reduplicated adjectives, which carry a single initial stress: /ˈɡord-ɡord/ ‘very red’ (lit. ‘red-red’) (Lytkin & Tepliashina Reference Lytkin and Tepliashina1962: 47; Winkler Reference Winkler2011: 22).
Additionally, stress placement in certain words and/or word classes is described as varying between the initial and final syllables. To the best of our knowledge, these descriptions have not been investigated instrumentally. It is unclear what conditions the variability – it has been described as dependent on ‘utterance type’ (Lytkin & Tepliashina Reference Lytkin and Tepliashina1962: 48; Csúcs Reference Csúcs1990: 29) or ‘emotional context of an utterance and/or logical emphasis’ (Alatyrev Reference Alatyrev1983). These cases include, e.g., pronouns formed with /vaɲ-/ ‘all, every-’, /koc-/ ‘every-/any-’, /kud-/ ‘which’, /so-/ ‘that’, /ta-/ ‘this’, /ma-/ ‘what’, /no-/ ‘no-’, /oɡ-/ ‘approximately’ (Lytkin & Tepliashina Reference Lytkin and Tepliashina1962: 48); a few illustrative examples are provided in (4).
Some other instances of reported variable stress placement include:
-
• certain adverbials (/ˈt͡ɕaʎak/ ∼ /t͡ɕaˈʎak/ ‘quickly’; /ˈjalan/ ∼ /jaˈlan/ ‘always’) (Lytkin & Tepliashina Reference Lytkin and Tepliashina1962: 48),
-
• wh-words (Vakhrushev & Denisov Reference Vakhrushev and Denisov1992: 66),
-
• adjectives derived with the suffix /-pɨr/: /ˈɡordpɨr/ ∼ /ɡordˈpɨr/ ‘reddish’ (Winkler Reference Winkler2011: 23),
-
• prohibitive verbs, in which stress may target either the negative particle or the first syllable of the lexical verb: /ˈen vera/ ∼ /en ˈvera/ ‘don’t say!’ (Lytkin & Tepliashina Reference Lytkin and Tepliashina1962: 47; Vakhrushev & Denisov Reference Vakhrushev and Denisov1992: 66; Winkler Reference Winkler2011: 23).
Since the phonological nature of this reported variability is unclear at present, we do not discuss these cases further. The interested reader is directed to the sources cited for more information.
Individual Udmurt dialects present further classes of exceptions to the strict stress-finality. In some varieties spoken in Northern Udmurtia, like the Middle Cheptsa dialect (Karpova Reference Karpova2005) and Beserman Udmurt (Tepliashina Reference Tepliashina1970), as well as in the Kukmor dialect, which belongs to the Southern Peripheral dialects (Kelmakov Reference Kelmakov1998: 74–75), verbs with plural agreement markers can be stressed on the penult or the ultima: /tuˈbomɅ/ ∼ /tuboˈmɅ/ ‘we will climb’.Footnote 3 A similar pattern of non-final stress is found in agreeing converbs, inflecting postpositions, inflecting pronouns and possessed nouns, which take on the same agreement markers: /turnaˈkudɅ/ ‘when you are/were mowing’, /bɘrˈɕamɅ/ ‘behind/after us’, /vicˈnamɅ/ ‘the five of us’, /bakt͡ɕaˈjamɅ/ ‘in our garden’ (Georgieva Reference Georgieva, Kiefer, Blevins and Bartos2017). Our elicitation materials did not include any items that could be subject to variable stress placement.
2.3 Previous studies on Udmurt stress
Some early instrumental studies investigating the nature of Udmurt stress are available, but their conclusions are quite limited in scope. Lytkin & Tepliashina (Reference Lytkin and Tepliashina1962) conclude, based on a handful of experimental tokens produced by one native speaker, that stressed (i.e., final) syllables are about 1.5 times longer than unstressed ones – though this is only the case in words uttered in isolation; in running speech the difference between the duration of stressed and unstressed vowels is reported as much smaller (Lytkin & Tepliashina Reference Lytkin and Tepliashina1962: 22−24, 49). The authors also note that greater intensity and f0 may be used as a secondary means of marking stress. In contrast, Baitchura (Reference Baitchura1973) observes, based on data from four native speakers (number of experimental tokens not reported), that initial syllables are marked by greater intensity and f0, while final ones are 1.5–2 longer than the initial ones. Baitchura (Reference Baitchura1973) interprets these findings as evidence for initial stress.
Finally, Vakhrushev & Denisov (Reference Vakhrushev and Denisov1992), building on Denisov (Reference Denisov1980), use di- and trisyllabic words as well as minimal pairs of 3sg indicative and 2sg/pl imperative verbs; the stimuli were tested with two native speakers. Their results show that the duration of stressed syllables is 1.6 times greater than that of unstressed syllables in disyllables and 1.7 times greater than that of unstressed syllables in trisyllables. The authors also compare the duration results for stimuli uttered in isolation with those used in connected speech and conclude that the duration of the stressed (final) syllables is greatest in words uttered in isolation and phrase finally. This points to a strong effect of final lengthening, which the authors themselves acknowledge (Vakhrushev & Denisov Reference Vakhrushev and Denisov1992: 74). In minimal pairs, the stressed syllables, both non-final and final, are shown to have greater duration than their unstressed counterparts: i.e., the stressed initial syllables of imperatives have greater duration than the unstressed initial syllables of indicatives, and the stressed final syllables of indicatives have greater duration than the unstressed final syllables of imperatives. With respect to intensity, Vakhrushev & Denisov (Reference Vakhrushev and Denisov1992: 77) conclude that it does not consistently cue stress, though, generally, stressed syllables in the minimal pairs have greater intensity than their unstressed counterparts. With respect to f0 contours, Vakhrushev & Denisov (Reference Vakhrushev and Denisov1992: 79) show that, in words uttered in isolation, the mean f0 of the second syllable is lower than that of the first syllable (which is also attributable to declarative intonation). In minimal pairs, the f0 results are more variable, but seem to point to a tendency for stressed syllables, both initial and final, to be associated with lower f0 values. Overall, while the quantitative results in Vakhrushev & Denisov (Reference Vakhrushev and Denisov1992) are not reported in detail, some of the general trends are clear. Their study served as an inspiration for our work.
2.4 Acoustic marking of stress and focus
The acoustic cues that have been mentioned in the existing studies of Udmurt stress, summarized in the previous section, are some of the cues canonically associated with the expression of stress. A non-exhaustive list of these cues, which are commonly discussed in the literature on the topic, includes duration of the stressed vowel/syllable, intensity (either overall or frequency-sensitive, also known as spectral tilt), formant frequency, as well as higher or lower f0 values on the stressed vowel/syllable. A detailed overview of the relative importance of these cues in a number of languages, based on a meticulous survey of the existing studies, is provided in Gordon & Roettger (Reference Gordon and Roettger2017). The overview shows that stressed vowels/syllables (or, occasionally, syllable codas/onsets) tend to have greater duration and/or greater intensity as compared to unstressed counterparts. With respect to formant frequency/vowel quality, stressed vowels are often more peripheral/lower in the vowel space than unstressed ones. Finally, higher or lower f0 values on the stressed vowel/syllable, in languages without lexical tone-based distinctions, are typically due to alignment with intonational f0 targets (high or low). The realization of focus/emphasis commonly relies on the cues that come from the same set; the cues for stress and focus in a given language may overlap, fully or partially (Vogel et al. Reference Vogel, Athanasopoulou, Pincus, Heinz, Goedemans and van der Hulst2016).
The unstressed counterparts that the properties of the stressed syllables/vowels are compared to may come from the same lexical item (i.e., precede or follow the stressed one in the same word – being in a so-called syntagmatic relationship). For instance, in the English noun /ˈpɝmɪt/ the properties of the stressed first syllable/vowel may be compared to those of the unstressed second one. Alternatively or additionally, in languages that allow for variable stress placement, a stressed syllable/vowel may be compared to an unstressed counterpart in the same position in a different word (a so-called paradigmatic relationship). For example, the realization of the stressed first syllable/vowel in the noun /ˈpɝmɪt/ may be compared to that in the unstressed initial syllable of the verb /pɚˈmɪt/.
With this background in mind, our hypotheses are the following. First, we expect stress in Udmurt to be cued by one of more of duration, intensity, vowel quality, and f0, with the relevant comparisons being syntagmatic or paradigmatic in nature. Second, we expect focus to be expressed by one or more of the cues from the same set.
3 Methods
3.1 Stimuli
Our investigation consisted of two production studies. The first one targeted Udmurt nominals, and the second one investigated the stress properties of minimal pairs formed by indicative and imperative verbs. The test words were collected from a dictionary (Kirillova Reference Kirillova2008) and checked with a native speaker who did not participate in the study. The choice of the test words, as described below, was determined by syllable structure and vowel height properties. Since the test words were selected from a dictionary of standard Udmurt, none of the stimuli were dialectal in a strict sense – i.e., used only in a particular dialect. Standard Udmurt sometimes codifies more than one lexeme/variant (often coming from different dialects) as standard, and speakers of Udmurt are usually familiar with the different lexemes in these cases, especially if, like our participants, they have studied standard Udmurt. Nevertheless, the speakers were instructed to skip a test word if they felt that they could not pronounce it in a natural way.
The materials for the first study comprised 109 Udmurt nouns, adjectives, and postpositions (which correspond to a nominal base inflected with a case suffix). All stimuli consisted of CV syllables and were controlled for syllable count (di- and trisyllabic) and vowel height (low, mid, and high; all vowels in a given word are of the same height). Both voiced and voiceless onsets were allowed, in order not to restrict the size of the dataset (for the purposes of f0 analysis, the first 20ms of the vowel were discarded; Xu (Reference Xu2013)). Because morphological structure is not mentioned in previous works as relevant for the purposes of stress assignment, both mono- and polymorphemic test words were used (excluding any morphology that may influence stress assignment, as described in Section 2.2). The breakdown of the dataset by syllable count and vowel height is provided in Table 2. The smaller number of ‘low’ stimuli is due to the fact that there is only one low vowel in Udmurt, /a/, as compared to three mid vowels (/e/, /ɘ/, and /o/) and three high vowels (/i/, /ɨ/, and /u/), which limits the number of possible stimuli with low vowels. The full list of stimuli used in the first study is provided in Appendix B.
All stimuli were embedded in carrier sentences as direct quotes; stress was not marked in any of the words. To control for phrasal prosodic environment, two sets of carrier phrases were constructed, following the set of recommendations in Roettger & Gordon (Reference Roettger and Gordon2017). In the first set, the test word was under narrow (contrastive) focus (henceforth referred to as ‘F’). This was ensured by explicitly contrasting the test word with another (following) word, of the same syllable count and structure, which was part of the carrier sentence. This is illustrated in (5) for the test word /baka/ ‘frog’, contrasted with the word /daɡa/ ‘horseshoe’. Only the first of the two words was analyzed.
In the second set of carrier sentences, the test word was explicitly out of focus (henceforth referred to as ‘non-F’). This was ensured by placing contrastive focus on another constituent (an adverb), which was part of the carrier phrase. Two subtypes of this kind of carrier phrase were used, containing different pairs of adverbs, to make the test sentences less repetitive. They are provided in (6).
One-hundred-and-nine test words by two phrasal conditions (F and non-F) yielded 218 test sentences. The test sentences were presented to participants in a randomized order. Three different randomized orders were created in order to control for effects of newness/familiarity of stimuli. Each participant was assigned to one of the randomizations.
The second study consisted of 43 minimal pairs formed by indicative and imperative verbs – i.e., 86 verb forms in total. Like the nominals in the first study, the verbs in the second study were di- and trisyllabic, consisted of CV syllables and were controlled for vowel height (low, mid, high). For morphological reasons, though, the final syllables, corresponding to prs.3sg/imp.2 markers, could only contain mid or low vowels, as was illustrated in (3) in Section 2.2. This means that in the high vowels category (i.e., the test words in which all vowels were supposed to be high), only the root vowel(s) were high, and the final syllable contained a mid vowel, /e/. Accordingly, we label this type of stimuli ‘high+mid’; for the purposes of analysis, the final mid vowels of the verbs in the high+mid category were grouped together with the other mid vowels. The breakdown of the dataset by syllable count and vowel height is provided in Table 3. The full list of stimuli used in the second study is provided in Appendix C.
Like in the first study, the phrasal prosodic context in which the test words appeared was controlled with the help of carrier sentences. In the focused (F) context, a test verb was explicitly contrasted with another verb of the same type (i.e., indicative or imperative) and same syllabic structure, as shown in (7). In the non-focused (non-F) context, with two subtypes, an explicit contrast was established between other elements of the carrier sentence (adverbs), as illustrated in (8). Similarly to the first study, the test words were used as direct quotes within the carrier sentences and stress was not marked on any of the words. Given that the second study targeted minimal pairs, the test materials indicated whether the speakers should produce an indicative or an imperative verb. If the verb was meant to be used as an imperative, it was accompanied by an exclamation mark within the direct quote; the indicative verbs were left unmarked. The participants were informed that the exclamation marks identify imperative verbs but are not meant to elicit exclamative intonation.
The 43 minimal pairs, equaling 86 verbs, multiplied by two phrasal contexts, produced 172 test sentences. Like in the first study, the stimuli were randomized; three different randomizations were used. Each participant was assigned to one randomization.
3.2 Procedure and participants
During the recording sessions for both studies, the test sentences in standard Udmurt orthography were presented to the participants on a computer screen, one sentence at a time. The participants were instructed to familiarize themselves with the sentence and then pronounce it using natural intonation. Each test sentence was uttered once by a participant. If the participant was not happy with the way they pronounced the sentence, they were allowed to re-do it; in such cases, all responses except the final one were discarded. Before proceeding to the test sentences in each study, the participants were required to complete a short training phase, consisting of four simple Udmurt sentences of various structure that they were asked to pronounce, in order to get accustomed to the experimental setting.
The studies were conducted in June 2020 in Budapest, Hungary. The recordings were carried out in a quiet room, using a Zoom H4n recorder and a close-range head-worn Shure SM10A microphone. Six native speakers of Udmurt (Sp1–Sp6) took part in the first study; five of the same six native speakers, except Sp3, also took part in the second study. The speakers received a small remuneration for their participation in the experiment (a gift card). The speakers were all female; age range: 22−39, mean age: 29.5 years. All were studying/working in Budapest, Hungary, at the time of the recording (the duration of residency in Hungary ranged from 1 month to 8 years and 10 months, mean: 4.625 years). All participants were Udmurt-dominant Udmurt-Russian bilingual speakers.
Four of the speakers were born and raised in central Udmurtia (Sp1, Sp2, Sp5, Sp6), one in northern Udmurtia (Sp4), and one in central-southern Udmurtia as well as in Izhevsk, the capital of Udmurtia (Sp3). All speakers lived in Izhevsk as adolescents, before relocating to Hungary, and have studied the standard variety of Udmurt in school and/or at the university. Given their background, we assumed that the participants’ speech would show both features characteristic of the respective (sub)dialects, as well as those of standard Udmurt; as far as we can tell, this is indeed the case. This is in line with recent sociolinguistic studies: Edygarova (Reference Edygarova2014) describes the colloquial language spoken among Udmurts from different dialect groups, primarily in an informal urban setting, as a so-called ‘cross-local vernacular variety of Udmurt’, which is a mix of local dialects with the standard variety and Russian code-switching. Furthermore, Edygarova (Reference Edygarova2014) argues that standard Udmurt is not a native language for Udmurt speakers, but rather an acquired literary style, primarily mastered through explicit linguistic training. Because of this complex sociolinguistic context, we instructed the speakers to pronounce the test sentences in the way that is most natural for them, as our intention was to study their native varieties as spoken by young urban-based speakers who also frequently use the cross-local Udmurt vernacular.
It is important to note that the (cross)dialectal background of our consultants does not differ from the standard language with respect to the vowel inventory. According to Kelmakov (Reference Kelmakov1998: 47, 60–61), the dialects spoken in Udmurtia (Northern, Central, Southern), as well as the standard language, have a seven-vowel system (as in Table 1) – as opposed to vowel systems that include up to ten vowels, which are characteristic of Udmurt-speaking communities outside the Udmurtia proper. The main point of variation among some of the dialects spoken within Udmurtia proper is the use of /Ʌ/ instead of /ɨ/ (see also footnote 3). As far as we can tell, this does not apply to the speech of our participants, which is confirmed by the formant distribution plots in Figure 9 and Figure 11. Accordingly, we expect no pronounced qualitative differences between the relevant aspects of the varieties spoken by our participants.
We also carefully controlled for the differences related to stress between Udmurt dialects, making sure not to include any test material where stress location may vary (see Section 2). Potential differences in phrasal prosody among Udmurt dialects have not been studied; it is a question for further research to determine whether the interspeaker variation that we notice in our data (Section 5.2) is to be explained as dialectal or idiolectal in nature. Due to the limited number of speakers in our study, we refrain from making any claims to this effect.
The recording sessions for the first study lasted between 16 and 47 minutes per participant, and between 12 and 25 minutes for the second study; there was a 30-minute break between the two studies. In total, 1,308 test sentences were recorded during the first study (218 test sentences * six participants), and 860 for the second study (172 test sentences * five participants).
3.3 Data processing
The audio files were manually annotated in Praat (2021) by trained research assistants, based on the segmentation criteria in Machač & Skarnitzl (Reference Machač and Skarnitzl2009), and checked by the authors. Disfluent responses (due to pauses, errors, false starts, throat clearing, etc.) were eliminated: forty-two in the first study, and fourteen in the second study.
While listening to the recordings, we identified the potential for a prosodic ambiguity in the carrier sentences in which the test words carried narrow focus – i.e., those like (5) and (7). Because negation in the second part of these sentences is expressed with a negative auxiliary, the sentences can be understood either as contrasting the test words in the two parts of the sentence or contrasting the verb in the first part with the negative auxiliary in the second part. That is, in examples like (5) and (7), either the two test words carry narrow (contrastive) focus (‘I said the word “frog”, and not the word “horseshoe”.’), or the two test words are interpreted as contrastive topics, and the verbs are narrowly (contrastively) focused (‘As for the word “frog”, I said it, but the word “horseshoe”, I didn’t.’).Footnote 4 Because there is no way to construct sentences of this type in Udmurt other than with a negative auxiliary in the second part of the carrier sentence, the ambiguity is unavoidable. Accordingly, we eliminated the responses in which the verbs carried the main accent and were contrasted with each other. In total, we eliminated 279 ‘verb-focus’ responses in the first study and 91 ‘verb-focus’ responses in the second study. Because the ‘verb-focus’ confound only applied to the ‘F’ condition, the number of ‘F’ responses ended up being lower than the number of ‘non-F’ ones, in both studies. The ‘verb-focus’ reading was especially favored by some of the participants: speakers Sp3 and Sp6 produced all of their ‘F’ responses with focus on the verb, which lead to the elimination of these responses.
Additionally, a native speaker of Udmurt who did not take part in the study listened to the recordings of the second study and eliminated the responses that were not produced on target (e.g., an indicative verb erroneously produced instead of an imperative one and vice versa). The responses eliminated for this reason totalled 22. The final counts of responses for both studies, broken down by focus type, syllable count, vowel height, and, in the second study, verb type, are provided in Table 4 and Table 5, respectively.
Abbreviations: disyll – disyllabic; n – total number; trisyll – trisyllabic
3.4 Analysis
3.4.1 Measurements
The ProsodyPro Praat script (Xu Reference Xu2013) was used to collect the acoustic parameters of the annotated segments (vowel duration, intensity, average f0 per vowel, f0 at ten fixed points per vowel, and F1 and F2 values).
In order to ensure comparability between the data from two studies, as well as between di- and trisyllabic test words, only the acoustic parameters on the initial and final syllables were analyzed (i.e., middle syllables of trisyllables were discarded). This does not necessarily mean that no stress cues are realized on the middle syllable of trisyllables. A preliminary exploration of the middle-syllable data points to some potentially relevant tendencies. In indicatives, the middle (i.e., pre-tonic) syllable may exhibit a degree of stress-related lengthening. In imperatives, there is wide variation in f0 on the second (i.e., post-tonic) syllable; this is not surprising, given that the f0 contour may be meaningful on the pre- and/or post-tonic syllables as well as the stressed syllable. As far as we can tell, though, any stress-related effects on the unstressed middle syllable in trisyllables are supplementary to the cues expressed on the stressed syllables themselves. For reasons of space, we leave a dedicated discussion of stress cues realized of syllables other than the stressed ones outside the scope of the current paper.
Abbreviations: disyll – disyllabic; n – total number; trisyll – trisyllabic
3.4.2 Statistical analysis
The statistical analysis was carried out in R (R Core Team 2020), using packages lme4 (Bates et al. Reference Bates, Maechler, Bolker and Walker2015), lmerTest (Kuznetsova et al. Reference Kuznetsova, Brockhoff and Christensen2017), and emmeans (Lenth Reference Lenth2022). Each of the acoustic measures (duration, intensity, f0, and F1 and F2 values) were analyzed using linear mixed effects models, using the lmer( ) function, with the acoustic measure as the dependent variable. Following the guidelines in Gries (Reference Gries2021), for each acoustic parameter, the most complex model was fit first, with the following fixed effects and their interactions: in the first study, focus type (with levels F and non-F), vowel height (with levels high, mid, and low), and syllable no. (with levels initial and final); in the second study, focus type (with levels F and non-F), vowel height (with levels high, mid, and low), syllable no. (with levels initial and final), and verb type (with levels indicative and imperative). The starting models also included random effects of word, nested in no. of syllables, and speaker; random slopes for syllable no. and focus type were also included. The starting models for the first and second studies are provided in (9a) and (9b), respectively.
The starting models did not converge for any of the acoustic measures. Next, the random effect structure was simplified, with the effects that accounted for the least amount variance dropped first and the resulting models compared via the function anova(). After that, the fixed effect structure was simplified, via the function drop1(). Eventually, for each acoustic measure, the most complex model that converged without numerical problems and with all predictors being significant was selected and evaluated (reported in the individual subsections in Section 4). P-values were obtained with the lmerTest package. If the interactions between the fixed effects proved significant for a particular acoustic measure, further pairwise comparisons were carried out using the package emmeans().
4 Results
For the ease of comparison of individual acoustic measures across the two studies, this section is divided into subsections based on acoustic measures, further subdivided into the results of the two studies.
4.1 Duration
4.1.1 First study (nominals)
Table 6 provides the mean vowel duration values in the test words of the first study. As these results show, final (stressed) syllables typically have greater duration than initial syllables, in non-F and especially in F contexts, for all vowel heights and in all syllable counts.
For the statistical analysis, the duration values were log-transformed (the raw data had a long right tail, corresponding to the outliers in Figure 1). A model that fit the data best included vowel height and syllable no. as fixed effects, speaker and word as random effects, and random slopes for syllable no. for both speaker and word (more complex slopes for random effects led to the non-convergence of the model); the model is summarized in (10). The interaction of fixed effects did not improve the model fit, suggesting that the effect of vowel height does not vary by syllable position. The remaining potential fixed effect, focus type, did not significantly affect the duration values, which means that vowel duration does not vary significantly depending on focus context. Notably, among the random effects, no. of syllables turned out to be redundant in the presence of word. A likelihood ratio test showed that the model is highly significant ( $\chi^{2}(3) = 82.26$ , $p < 0.001$ ), with the conditional R 2 of 0.788.
Abbreviations: disyll – disyllabic; ms – milliseconds; n – total number; SD – standard deviation; trisyll – trisyllabic
The duration data, organized according to the fixed effects that proved to be significant (vowel height and syllable no.), is visualized in Figure 1. The output of the model is provided in Table 7. The results show that syllable number (which also corresponds to stress in the first study) has a significant effect on vowel duration, and so does vowel height. The lack of the significant effect of interaction between the fixed effects suggests that syllable number has a comparable effect on duration in all vowel heights.
4.1.2 Second study (verbs)
The mean vowel duration values for initial and final syllables in verbs, broken down by verb type, focus type, syllable count, and vowel height are provided in Table 8. The grayed-out cells indicate that there were no final high vowels attested. The mid vowels that were used in the final syllables in the high-vowel contexts instead are pooled with the other final mid vowels. As the results show, both initial and final syllables, when stressed, are typically longer than their unstressed counterparts. The duration of vowels in focused verbs is greater than that in their non-focused counterparts – especially in indicatives.
Abbreviations: df – degrees of freedom; SE – standard error
Like in the first study, the duration values were log-transformed for the statistical analysis. A model that fit the data best was more complex than in the first study: it included all four fixed effects and two interactions: syllable no. * vowel height + verb type * focus type, suggesting that (i) all fixed effects (or the interactions included) have a significant effect on duration, (ii) the effect of syllable number on duration varies by vowel height, and (iii) the effect of focus type on duration varies by verb type. The random effects included speaker and word, but a random slope for syllable no. could only be used with the random effect speaker; the model is summarized in (11). A likelihood ratio test showed that the model is highly significant ( $\chi^{2}(5) = 96.839$ , $p<0.001$ ), with the conditional R 2 of 0.610.
Abbreviations: disyll – disyllabic; ms – milliseconds; n – total number; SD – standard deviation; trisyll – trisyllabic
The duration data, broken down according to three significant fixed effects (vowel height, syllable no., and focus type), and presented separately based on verb type, is provided in panels (a) and (b) of Figure 2. As the panel (b) of Figure 2 shows, initial stress in low-vowel imperatives leads to initial vowels being longer than final ones, reversing the pattern shown for low-vowel indicatives in panel (a) – in a syntagmatic fashion described in Section 2.4. For mid-vowels, though, initial stress in imperatives leads to greater duration of initial vowels that makes them longer than their unstressed counterparts in indicatives, though not longer than final mid vowels in imperatives; this exemplifies paradigmatic signaling of stress.
The output of the model is provided in Table 9.Footnote 5 It shows that syllable number alone does not significantly affect duration, while vowel quality, verb type, and focus type do (the latter to a lesser degree, though the effect is still significant). Additionally, there are significant interaction effects of syllable number and vowel height, and verb type and focus type.
Interactions of fixed effects indicate that their effect on duration is non-uniform. To start with the effect of syllable number, row [2] in Table 9 shows that there is no significant durational difference between initial and final high vowels (where the values for the latter are estimated by the model). A pairwise comparison with the emmeans() function shows that this is also the case for low vowels (Estimate = –0.0541, SE = 0.155, df = 4.24, t = –0.349, p = 0.744) and mid vowels (Estimate = –0.0980, SE = 0.155, df = 4.20, t = 0.633, p = 0.560). Next, row [5] in Table 9 shows that there is a significant difference between vowel durations in imperative and indicative verbs in the non-F condition. A pairwise comparison shows that is the case in the F condition, too (Estimate = 0.105, SE = 0.0278, df = 1414, t = 3.774, p < 0.001* * *). Finally, row [6] in Table 9 shows that, within the imperative verbs, the effect of focus on duration is significant. A pairwise comparison shows that this is not the case in indicative verbs, though (Estimate = –0.0237, SE = 0.0268, df = 1414, t = –0.882, p = 0.370). In other words, focus is marked by duration in imperatives but not in indicatives.
Abbreviations: df – degrees of freedom; F – focused; non-F – non-focused; SE – standard error
To sum up, the first study has shown that vowel height and syllable number, but not focus type, affect vowel duration, which means that vowel duration is a cue for stress but not focus type. The results of the second study are more complex, demonstrating that vowel height, syllable number, verb type and focus type all affect duration. A more in-depth look shows that duration is a cue for stress, but cues focus marking in imperatives only. We also observe considerable inter-speaker variation with respect to using duration as a cue for stress and/or focus; more on this in Section 5.
4.2 Intensity
4.2.1 First study (nominals)
The mean vowel intensity values obtained in the first study are summarized in Table 10. As these results show, final (stressed) and initial (unstressed) syllables have comparable intensity values. High vowels consistently have higher intensity in final syllables, in both focus types and syllable counts. The same cannot be said about mid or low vowels: they consistently have lower intensity values in the final syllables.
Abbreviations: dB – decibels; disyll – disyllabic; n – total number; SD – standard deviation; trisyll – trisyllabic
A mixed-effects model that provided the best fit included vowel height and syllable no. as interacting fixed effects, speaker and word as random effects, and by-speaker and by-word random slopes for syllable no. The interaction of fixed effects improved the model fit significantly, suggesting that the effect of vowel height on intensity varies by syllable position. The model is summarized in (12). focus type did not significantly affect the intensity values, which suggests that intensity does not mark focus. According to a likelihood ratio test, the model is highly significant ( $\chi^{2}(5) = 98.61$ , $p<0.001$ ), with the conditional R 2 of 0.592.
The intensity results, broken down by vowel height and syllable no. (the only significant fixed effects) are visualized in Figure 3. As it demonstrates, despite the significant results, stress does not systematically correspond to higher or lower intensity values in different vowel heights.
The output of the model is provided in Table 11. According to it, syllable number (which corresponds to stress in the first study) has a significant effect on vowel intensity. So does vowel height, and the interaction of syllable number and vowel height suggests that the effect of syllable number/stress on intensity varies by vowel height. Row [2] in Table 11 shows this effect for high vowels. An emmeans() calculation of the missing pairwise comparisons showed that this effect also holds for mid vowels (Estimate = –0.810, SE = 0.311, df = 25.6, t = –2.606, p < 0.05*) but not for low vowels (Estimate = –0.351, SE = 0.381, df = 49.1, t = –0.923, p = 0.36).
Abbreviations: df – degrees of freedom; SE – standard error
4.2.2 Second study (verbs)
Mean vowel intensity values for initial and final syllables in verbs, broken down by verb type, focus type, syllable count, and vowel height are provided in Table 12. As before, the grayed-out cells indicate the cells for which no vowels were attested. The mid vowels that were used in the final syllables in the high-vowel contexts are pooled with the other final mid vowels. Similarly to the picture for low and mid vowels in the first study, the intensity values in the final syllables are typically lower than in the initial syllables, across contexts. The difference between intensity values in the initial and final syllables is more pronounced in the imperatives. This is consistent with the overall tendency for intensity to fall throughout a word/prosodic constituent. In indicatives, this tendency is mitigated somewhat by the fact that final stress brings up the intensity values on the final vowel, leading to more levelled intensity values between the two syllables. In contrast, in imperatives, this tendency is more pronounced, because initial stress gives an extra intensity boost to the initial vowel. This picture is also consistent with paradigmatic cuing of stress.
Abbreviations: dB – decibels; disyll – disyllabic; n – total number; SD – standard deviation; trisyll – trisyllabic
A mixed-effects model that fit the data best included vowel height, syllable no. and verb type as fixed effects, speaker and word as random effects, and random slopes for syllable no. in both random effects. Possible interactions of fixed effects did not improve the model fit. The model is summarized in (13). Like in the first study, focus type did not significantly affect the intensity values, which suggests that intensity does not mark focus. According to a likelihood ratio test, the model is highly significant $(\chi^{2}(4) = 120.7$ , $p < 0.001$ ), with the conditional R 2 of 0.670.
Abbreviations: df – degrees of freedom; SE – standard error
The intensity results, broken down by the significant fixed effects (vowel height and syllable no.) and visualized separately for the two verb types (indicatives and imperatives) are presented in the two panels of Figure 4.
The output of the model is provided in Table 13. As it demonstrates, syllable number has an effect on intensity (smaller than the other factors, but still significant), and so do vowel height and verb type (both highly significant). Lack of interactions between the fixed effects suggest that they affect intensity in a uniform way.
To sum up the intensity results, we have seen that intensity consistently marks stress in both studies but does not mark focus.
4.3 Fundamental frequency (f0)
4.3.1 First study (nominals)
The mean f0 values per vowel, collected from the test words in the first study, are summarized in Table 14. Additionally, as an illustration, Figure 5 demonstrates average f0 contours per vowel (the figure is divided by syllable number rather than by focus type for an easier comparison of the effect of focus). As these results show, average f0 values are similar across contexts, with a slight fall from the initial to the final syllable being common. As Figure 5 shows, focus is often marked by a slight rise toward the end of the final vowel – but it is too subtle to be reflected in the mean values.
Abbreviations: disyll – disyllabic; f0 – fundamental frequency; Hz – Herz; n – total number; trisyll – trisyllabic; SD – standard deviation
A mixed-effects model that provided the best fit for the data included vowel height and syllable no. as interacting fixed effects, speaker and word as random effects, and by-speaker and by-word random slopes for syllable no. The interaction of fixed effects significantly improved the model fit. The model is summarized in (14). focus type did not significantly affect the mean f0 values (though, as Figure 5 demonstrates, the effect of focus may be reflected in the final rise, which is not captured by the model). A likelihood ratio test shows that the model is highly significant ( $\chi^{2}(5) = 60.296$ , $p<0.001$ ), with the conditional R 2 of 0.588.
The mean f0 values per vowel, organized according to the significant fixed effects (vowel height and syllable no.), are shown in Figure 6.
The output of the model is provided in Table 15. It shows that syllable number by itself does not affect f0, while vowel height does. Additionally, there is a significant effect of the interaction between syllable number and vowel height, though only for initial low vowels (row [5]) and not for initial mid vowels (row [6]). The interaction between the fixed effects also allows for looking into whether there is a difference in f0 values between initial and final syllables for vowel heights other than high (row [2]). An emmeans() calculation of the missing pairwise comparisons shows that there is no significant difference in f0 values between initial and final syllables either for mid vowels (Estimate = –0.184, SE = 8.70, df = 7.58, t = –0.021, p = 0.984) or low vowels (Estimate = –6.824, SE = 8.81, df = 8.02, t = –0.774, p = 0.461), consistently with the high vowels.
4.3.2 Second study (verbs)
The mean f0 results per vowel that were obtained in the second study are provided in Table 16. Figure 7 additionally presents the averaged f0 contours over the vowels in all experimental contexts. As the results show, in both the focused and non-focused condition, imperative verbs typically have higher f0 values than their indicative counterparts. Interestingly, this holds both for initial syllables and final syllables. Within each verb type, focused verbs also typically have higher overall f0 values than their unfocused counterparts. The drop in f0 between the initial and final syllables is steeper in the imperatives than in the indicatives.
Abbreviations: df – degrees of freedom; SE – standard error
A mixed-effects model that fit the data best included vowel height, verb type, and focus type as fixed effects, and speaker and word as random effects. No random slopes were included (adding them to the model led to non-convergence). Interestingly, syllable no. did not turn out to have a significant effect on f0, in contrast with the other acoustic measures discussed so far. Including interactions of the fixed effects did not improve the model fit. The model is summarized in (15). A likelihood ratio test showed that the model is highly significant ( $\chi^{2}(4) = 138.24$ , $p<0.001$ ), with the conditional R 2 of 0.410.
The distribution of the mean f0 values, organized by the significant fixed effects vowel height and focus type, and shown separately for the two verb types, is illustrated in the two panels of Figure 8. Note that because syllable no. was not a significant factor, initial and final vowels are lumped together in Figure 8.
The output of the model is provided in Table 17. As it demonstrates, each of vowel height, verb type and focus type has a highly significant effect on f0. Lack of interactions between the fixed effects suggests that they affect f0 in a uniform way.
Let us sum up the f0 results. The first study shows that, in a set of data with uniformly final stress, f0 is used to cue syllable number/stress but not focus. The second study shows that, in a dataset that contains stress-based minimal pairs, f0 is used to cue both the verb type – imperative (i.e., with initial stress) versus indicative (i.e., with final stress) – and presence versus absence of focus. Interestingly, f0 is not used to differentiate syllable position (initial versus final), which suggests that a given verb type exhibits characteristic f0 values that differentiate it from verbs of the opposite type on both the final and initial syllables. As was the case with duration, we also observe wide inter-speaker variation in the f0 contour utilized; more on this in Section 5.
Abbreviations: disyll – disyllabic; f0 – fundamental frequency; Hz – Herz; n – total number; SD – standard deviation; trisyll – trisyllabic
Abbreviations: df – degrees of freedom; F – focused; non-F – non-focused; SE – standard error
4.4 Vowel height (F1)Footnote 6
4.4.1 First study (nominals)
The mean F1 values per vowel in the first study are summarized in Table 18. Additionally, as an illustration, Figure 9 demonstrates both F1 and F2 parameters of the vowels. As these results show, the F1 values are typically higher in the final syllables than in the initial syllables and tend to be lower in the non-F condition as compared to the F condition.
A mixed-effects model that fit the data best included syllable no., vowel height and focus type as fixed effects, with vowel height and focus type interacting, speaker and word as random effects, and random slopes for syllable no. in both random effects. The interaction of the two out of three fixed effects significantly improved the model fit. The model is summarized in (16). A likelihood ratio test shows that the model is highly significant ( $\chi^{2}(6) = 357.96$ , $p<0.001$ ), with the conditional R 2 of 0.906.
Abbreviations: disyll – disyllabic; n – total number; F1 – first formant; Hz – Herz; SD – standard deviation; trisyll – trisyllabic
The F1 results, organized according to the significant fixed factors (focus type, syllable no., and vowel height), are visualized in Figure 10. They demonstrate the same overall tendencies as those shown in Table 18: both stress and focus are associated with higher F1 values.
Abbreviations: df – degrees of freedom; F – focused; non-F – non-focused; SE – standard error
The output of the model is provided in Table 19. It shows that the F1 values are affected by syllable number and focus type (in addition to vowel quality, which is directly tied to differences in F1). The interaction between vowel height and focus type also allows for looking into whether the significant difference in F1 between the two focus contexts holds for vowels of all heights. Row [3] in Table 19 shows that it does for high vowels, and an emmeans() calculation of the missing pairwise comparisons shows that the same is true for low vowels (Estimate = –27.07, SE = 4.89, df = 1756, t = –5.532, p <0.001* * *) but not for mid vowels (Estimate = 2.54, SE = 3.77, df = 1761, t = 0.674, p = 0.5).
4.4.2 Second study (verbs)
Table 20 shows the mean F1 values for the vowels in the second study. Like with the F1 results in the first study, in Figure 11 we are also providing the F1 by F2 distribution for the vowels in the second study, divided by verb type and focus type. Similarly to the first study, there is a tendency for stressed syllables (initial in imperatives, final in indicatives) to have higher F1 values, and for F1 values in the F condition to be higher than in the non-F condition.
Abbreviations: disyll – disyllabic; F1 – first formant; Hz – Herz; n – total number; SD – standard deviation; trisyll – trisyllabic
A mixed-effects model that provided the best fit for the data consisted of vowel height and verb type as fixed effects, and speaker and word as random effects. Including random slopes led to the non-convergence of the model. Including an interaction of the fixed effects did not improve the model fit. Interestingly, neither focus type nor syllable no. turned out to be significant factors. The model is summarized in (17). A likelihood ratio test showed that the model is highly significant ( $\chi^{2}(3) = 1286.1$ , $p<0.001$ ), with the conditional R 2 of 0.878.
Figure 12 visualizes the distribution of the F1 data, shown separately for the two verb types, and organized according to the only remaining significant fixed factor, vowel height.
The output of the model is provided in Table 21. As it shows, there is a highly significant effect of vowel height, which is expected, given the intrinsic connection between F1 and vowel height. It also shows that there is a systematic difference between the two verb types, but not the two syllable positions. This probably has to do with the morphological reasons, though: the set of stressed vowels in indicatives (/a, e/) corresponds to the unstressed vowels in imperatives, and vice versa. Because not all vowels are represented in each set, a difference between verbs on the whole but not individual syllable positions is detected. Lack of interaction between the fixed effects does not allow for looking into whether this is true of all vowel heights.
Abbreviations: df – degrees of freedom; SE – standard error
To sum up the F1 results, in the first study, F1 was shown to systematically differ for vowels of different height, but also syllable number and focus type. In the second study, F1 was not involved in focus marking, and instead only differed for vowels of different height and for different verb types.
5 Discussion
5.1 General
To recap, the goal of the two studies reported here was to investigate the acoustic expression of stress and focus in Udmurt, using a predetermined inventory of acoustic cues (duration, intensity, f0, F1), in the context of fixed and contrastive stress. Our results show that different acoustic cues may be involved in marking both stress and focus. The most systematic behavior among the cues that we surveyed is exhibited by intensity: it was shown to consistently mark stress, in both studies, but was not involved in marking focus. The behavior of duration is more complex: in the first study, it marked stress but not focus; in the second study, it differentiated both verb types and syllable numbers, as well as focus types. Similarly to duration, f0 in the first study cued stress but not focus, while in the second study it was shown to be a significant predictor for both focus and verb type. Finally, F1 in the first study cued both stress and focus, but only different verb types in the second study. The results of both studies are summarized in Table 22. Overall, our results show that all four acoustic cues systematically participate in stress marking, while focus is expressed by fewer cues, which also differ from study to study.
Abbreviations: f0 – fundamental frequency; F1 – first formant
While the studies aimed at investigating stress cues that also control for the focus structure of the utterance, as recommended by Roettger & Gordon (Reference Roettger and Gordon2017), are still relatively few, our results can be compared to some of those obtained for other languages. Suomi et al. (Reference Suomi, Toivanen and Ylitalo2001) show that, in Finnish, (contrastive) focus is marked both by f0 and duration, while the position of stress is not marked by f0 (the role of duration as a cue for stress is not discussed in detail). Similar results, with duration cuing stress and f0 being reserved for intonational prominence, were obtained for Georgian (Borise Reference Borise2023). Finally, in a study targeting four languages (Hungarian, Turkish, Greek and Spanish), Vogel et al. (Reference Vogel, Athanasopoulou, Pincus, Heinz, Goedemans and van der Hulst2016) highlight the cross-linguistic variability in stress- and focus-marking. Among other results, they show that, in Spanish and Greek, f0 is the main cue for word stress, while duration and intensity, respectively, acted as important cues for focus in the two languages – in contrast with the results for Finnish and Georgian. For Hungarian, a language with contrastive vowel length, they show that duration is not reliably used to cue stress or focus – both are expressed mainly with f0. Further work within this methodology should help uncover more reliable cross-linguistic and language-specific tendencies.
5.2 Interspeaker variation
As noted in Sections 4.1 (duration) and 4.3 (f0), we found that individual participants used the acoustic cues that we investigated differently, and also, in some cases, used them differently between the two studies. The small sample size does not allow us to identify these differences as merely idiolectal or representative of a particular variety of Udmurt, but we hope that highlighting them here can be instructive for future work on the prosodic phonology of Udmurt. For example, Table 23 shows that Sp1 uses a much greater increase in duration to mark stress than all other speakers, and does so consistently between the two studies, whereas, e.g., Sp4 does not consistently use duration in the first study, and Sp5 does not in the second study.
Similarly, Table 24 shows that there is also considerable variation in the use of f0. For instance, Sp2 and Sp3 do not vary f0 between the stressed and unstressed syllables in the first study, and Sp1 and Sp4 do not use f0 to make any contrasts (indicatives vs. imperatives, F vs. non-F, on corresponding syllables) in either study, with all differences being below 10Hz. Table 24 also presents evidence for qualitative differences in the use of f0 between the speakers. It shows that in the first study, Sp1 and Sp4 used falling f0 contours, while Sp5 and Sp6 used rising ones. In the second study, we see that Sp2 uses a rising contour on the imperatives while not varying f0 on the indicatives, Sp5 continues to use a rising contour in most contexts in the second study, and Sp6 uses a rising contour with the indicatives and a falling one with the imperatives (i.e., aligns the stressed syllable with higher f0).
Abbreviations: F – focused; Non-F – non-focused; Sp – speaker
Abbreviations: F – focused; Non-F – non-focused; Sp – speaker
Some of the attested dimensions of variation are illustrated in Figure 13. Panel (a), an indicative verb produced by Sp4, demonstrates the increased duration of the stressed (final) vowel, and a falling f0 contour; in panel (b), the effect of duration is even more apparent on the stressed (initial) vowel of an imperative verb, produced by the same speaker. The stressed syllable is also aligned with a low tone; there may be a leading high tone on the preceding pronoun. The overall magnitude of f0 movement is quite small. Panels (c) and (d) provide the realizations of an indicative and imperative, respectively, by Sp5. Here, the stressed vowels are aligned with a high tonal target, with the rise on the stressed vowel and the peak reached on the following syllable. There is little evidence for greater duration of the stressed vowel. The magnitude of f0 movement is also much larger, with the rise covering more than 100Hz. The f0 scale is kept constant in panels (a–d) to allow for an easier cross-speaker comparison.
The availability of this variation with respect to acoustic cues used to mark stress and focus raises non-trivial questions about the processing and perception of stress and the nature of phonetic-phonology interface. It aligns with the available neurolinguistic evidence suggesting that speakers expect varying individual acoustic cues to be utilized in marking stress in a single language (Honbolygó & Csépe Reference Honbolygó and Csépe2011). It also provides support to the view that phonetic evidence may not provide straightforward one-dimensional physical corroboration for phonological concepts like stress (Keating Reference Keating1996).
5.3 Autosegmental-Metrical interpretation of the f0 contours
No Autosegmental-Metrical account of intonation in Udmurt has been developed so far, which means that we can only offer a tentative interpretation of the attested f0 contours associated with stress in terms of individual tonal targets. Due to the scarcity of evidence, we refrain from addressing other issues of Udmurt intonational phonology at this time (e.g., boundary tones, phrasing patterns, etc.).
As Figure 7 shows for f0 movements in the second study, the initial stressed syllable in imperatives is associated with a rise in f0 that is mostly confined to the stressed syllable, with the peak reached towards its end, and a gradual fall throughout the rest of the word. This is likely due to the availability of the H* pitch accent in Udmurt. Final stress, as shown in both Figure 5 for the first study and Figure 7 for the second study, is associated with a drop or drop and rise in f0. This suggests a pitch accent with an L component, like L* or L*+H. As panels (a) and (b) in Figure 13 show, there may also be evidence for a high leading tone accompanying the low pitch accent, H+L*. As panels (c) and (d) of Figure 13 show, the pitch accent may also be realized as a steep rise on the stressed vowel, with the peak reached on the post-tonic syllable, preceded by a stretch of lower f0 values. It may be analyzable as H*, or L*+H. Whether the emerging inventory of H*, L*, H+L* and L*+H pitch accents is substantiated for Udmurt should be explored in future research.
6 Conclusion
Our results show that all four acoustic parameters surveyed in the paper – duration, intensity, f0 and F1 – participate in stress marking in Udmurt. The results for focus marking vary by study and demonstrate that all cues except for intensity may be involved in focus marking. As expected, we also found that vowel height leads to significant differences in all acoustic cues, but, somewhat surprisingly, we found that number of syllables was not a significant factor in the presence of the random effect of word. The wide interspeaker variation demonstrates that the averaged results may present a somewhat simplified picture, while individual speakers may rely more heavily on a subset of the acoustic cues to mark stress and/or focus. Finally, we offer a tentative Autosegmental-Metrical interpretation of our f0 results; a full account of Udmurt intonation awaits further research.
Acknowledgments
We thank the Udmurt native speaker consultants who participated in our study – тау бадӟ̤ым! – as well as our research assistants: Gergő Turi, Bernadett Dam, and Péter Hatvani. For feedback at various stages of this project, we are grateful to Erika Ásztalos, Marcel den Dikken, Balázs Surányi, and the audiences at the Speech Research/Beszédkutatás conference 2020, the workshop on the languages of Volga-Kama Sprachbund 5, GLOW 44, TAI 1, and the workshop on lexical and fixed word stress at SLE 55. We also gratefully acknowledge Stefan Th. Gries and Katalin Mády’s advice on statistical analysis, and insight from Joseph A. Stanley (via his blog) and Jacob T. Blaskovits on data visualization. All remaining errors are ours.
Funding information
This research was supported by grants NKFIH KKP 129921 and NKFIH K 135958 of the National Research, Development, and Innovation Office of Hungary, and has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 101109402.
Appendix A Abbreviations
The glosses follow the Leipzig Glossing Rules (http://www.eva.mpg.de/lingua/resources/glossing-rules.php), with some additions:
1—first person
2—second person
3—third person
acc—accusative
adjz—adjectivizer
freq—frequentative
fut—future
imp—imperative
neg—negation
pl—plural
prs—present
pst—past
ptcp—participle
sg—singular
vbz—verbalizer
Appendix B Materials used in the first study
All items used in the first study are provided below, accompanied by English glosses. In morphologically segmentable items, morphemes are marked with a hyphen. The notation ‘(–)’ indicates that a word is not fully transparent morphologically. This is the case for postpositions containing the illative case /–e/, and nouns formed with non-productive nominalizers /–ʎi/ and /–ri/; in these instances, only a lexical translation is provided. Stress is indicated here for presentational purposes, but it was not marked in the experimental materials (see Section 3.1).
Appendix C Materials used in the second study
All items used in the second study are provided below, accompanied by English glosses. In morphologically segmentable items, morphemes are marked with a hyphen. The relevant suffixes are verbalizers /–(j)a/ and /–om/, the frequentative marker /–l/, as well as the present tense third person singular marker /–e/ and the second person plural imperative marker /–e/ (both found in Conjugation I verbs). The notation ‘(–)’ indicates that the combination of the stem and the verbalizer is not fully transparent morphologically; in these cases, only the translation of the verb is given. We list both the indicative and the imperative verbs that form minimal pairs, together with their glosses. Stress is marked in the table for presentational purposes; it was not indicated in the experimental materials (see Section 3.1).