Introduction
Unlike many written languages, where words are separated by spaces, spoken communication is delivered in continuous utterances with only occasional pauses and no clear demarcation of words (Cole & Jakimik, Reference Cole and Jakimik1980). Yet, adults are usually able to segment speech with no problem, without even realising that there are no such markings. They are assisted in part by more developed lexicons, which they use to identify familiar words in the speech stream. Children, on the other hand, are born with no lexicon to consult, yet by the age of six months they are already capable of segmenting the speech stream into words and phrasal units (Jusczyk, Reference Jusczyk1999).
The question of how children are able to learn to segment speech and bootstrap their lexicons is the word segmentation problem. In the 1980s and 1990s there was a renewed interest in examining the statistical properties of language, and in particular how these may impact the understanding of language acquisition and comprehension (Christiansen et al., Reference Christiansen, Allen and Seidenberg1998). Psycholinguistic studies from this time found that children use statistical properties of language to help solve the word segmentation problem. Such properties include lexical stress (Cutler & Carter, Reference Cutler and Carter1987; Cutler & Mehler, Reference Cutler and Mehler1993; Jusczyk, Cutler, et al., Reference Jusczyk, Cutler and Redanz1993), phonotactics (Jusczyk, Friederici, et al., Reference Jusczyk, Friederici, Wessels, Svenkerud and Jusczyk1993; Mattys et al., Reference Mattys, Jusczyk, Luce and Morgan1999; Mattys & Jusczyk, Reference Mattys and Jusczyk2001), predictability statistics (Saffran, Aslin, et al., Reference Saffran, Aslin and Newport1996a; Saffran, Newport, et al., Reference Saffran, Newport and Aslin1996; Thiessen & Saffran, Reference Thiessen and Saffran2003), allophonic differences (Jusczyk, Hohne, et al., Reference Jusczyk, Hohne and Bauman1999), coarticulation (Johnson & Jusczyk, Reference Johnson and Jusczyk2001), vowel harmony (Suomi et al., Reference Suomi, McQueen and Cutler1997) and prosody (Cooper & Paccia-Cooper, Reference Cooper and Paccia-Cooper1980; Gleitman et al., Reference Gleitman, Gleitman, Landau and Wanner1988).
Interest in the segmentation problem, combined with the evidence provided by these psycholinguistic studies, has led to the design of a variety of computational models for an abstract version of the task. In the established paradigm, utterances are represented symbolically as strings of phones or phonemes without word boundaries, and models have the task of finding these boundaries without supervision. Besides offering insight into the segmentation problem, such models have also developed into successful algorithms for segmenting written text in languages where word boundaries are not marked (Feng et al., Reference Feng, Chen, Deng and Zheng2004; Sproat & Shih, Reference Sproat and Shih1990).
In this study, we compare two approaches taken by such computational models; the boundary-finding approach and the language modelling approach. The boundary-finding approach considers statistical information present around each inter-phoneme position to make local boundary decisions, often operating phoneme-by-phoneme. Language modelling methods operate utterance-by-utterance, calculating the most-likely segmentation of each, based on lexical recognition. We re-implement the top-performing models for these two approaches, a language modelling approach known as PHOCUS (Blanchard et al., Reference Blanchard, Heinz and Golinkoff2010; Venkataraman, Reference Venkataraman2001) and a boundary-finding model known as MULTICUE (Çöltekin, Reference Çöltekin2017; Çöltekin & Nerbonne, Reference Çöltekin and Nerbonne2014), both of which achieve similar scores on the child-directed utterances in the English BR corpus, the de-facto standard for evaluating segmentation models (Brent, Reference Brent1999). Word segmentation models are typically trained on speech corpora directed at children aged less than two years because early indicators of infants’ word segmentation abilities are well-attested in the first year (Bergelson & Swingley, Reference Bergelson and Swingley2012; Johnson & Jusczyk, Reference Johnson and Jusczyk2001; Saffran, Aslin, et al., Reference Saffran, Aslin and Newport1996a) and most children are regularly producing multi-word utterances by this stage. Therefore, child-directed utterances in the first two years are held to be both crucial and sufficient for children learning to segment words.
Through the comparison of these two approaches, we observe that the boundary-finding methods can combine information from multiple sub-lexical cues, but cannot make decisions based on the placement of other boundaries. We also find that the language modelling methods can make decisions based on the placement of other boundaries, but cannot combine information from multiple sub-lexical cues.
The aim of our research is to investigate whether giving a statistical model access to both lexical and sub-lexical cues improves its ability to segment words across a range of languages. By considering cues that are accessible to children and allowing the model to utilize whichever cues it deems valuable, an increased ability to segment words would suggest that both types of cue provide complementary information useful for word segmentation. This finding would contribute to our understanding of language acquisition by highlighting the extensive knowledge that statistical methods can acquire from the linguistic signal alone, leading to further inquiry into the additional linguistic knowledge that can be learned jointly with or subsequent to word segmentation.
In this study, we develop the DYnamic programming MULTIple-cue (DYMULTI) framework for modelling word segmentation. This framework combines the strengths of both the boundary-finding and language modelling approaches and allows for the consideration of sub-lexical and lexical cues, achieving higher F1-scores scores on the BR corpus than any previous model that uses the same constraints. We also undertake novel cross-lingual evaluation of these models, finding that our model outperforms PHOCUS and MULTICUE on 15 of 26 languages, confirming its validity as a computational model for infant word segmentation. In doing so, we also find that there may be previous research bias towards performance on English corpora. The contributions of our paper are as follows:
-
• We give a thorough review of the word segmentation problem and the previous psycholinguistic and computational modelling studies that have investigated it. We introduce the DYMULTI framework for segmentation, which achieves the highest F1-scores to date.
-
• We perform a thorough and robust benchmarking of segmentation models, comparing the PHOCUS, MULTICUE and DYMULTI frameworks. This includes an investigation into the effect of utterance order, a comparison of learning rates across models, and cross-lingual evaluation across 26 languages.
-
• We release our implementations of PHOCUS, MULTICUE and DYMULTI as an open-sourced repository for reproducibility and future researchFootnote 1.
Background
In this section, we give the psycholinguistic background to the word segmentation problem. We then discuss the computational models that have been designed to explore it, detailing the boundary-finding and language modelling approaches.
Cues for segmentation
Despite the lack of consistent acoustic gaps between spoken words, adults are able to segment the speech stream into linguistically significant units and therefore access their meaning, a process called segmentation. Early models of speech processing declared segmentation to be a by-product of lexical identification (Cole & Jakimik, Reference Cole and Jakimik1980; Marslen-Wilson & Welsh, Reference Marslen-Wilson and Welsh1978), later described as “serendipitous” or “interactionist” segmentation models (Cairns et al., Reference Cairns, Shillcock, Chater and Levy1997; Cutler, Reference Cutler1996). These models identify words in the speech stream by matching them against the listener’s lexicon, either processing the utterance in a strictly temporal order, as in the COHORT model of Marslen-Wilson and Welsh (Reference Marslen-Wilson and Welsh1978) or by using the activation of competing lexical items to cut up the input, as in the TRACE model of McClelland and Elman (Reference McClelland and Elman1986). These models can make use of sub-lexical cues, such as adults’ sensitivity to phonotactic information, to make judgements about possible words (Greenberg & Jenkins, Reference Greenberg and Jenkins1966), but are primarily driven by the lexicon.
Another view of speech processing is that segmentation occurs purely on the basis of information in the speech signal without making use of any lexical influences. Cutler (Reference Cutler1996) calls these “explicit” segmentation models and multiple studies have found that adults can segment using purely using low-level information. Saffran, Newport, et al. (Reference Saffran, Newport and Aslin1996), for example, found that within 20 minutes of exposure to an artificial language, adults are able to use phonotactic information to tell non-words apart from words in a speech stream. Such studies do not refute interactionist accounts, as these can still incorporate low-level information, but they do provide evidence that adult segmentation is not fully driven by lexical recognition.
When it comes to infants, there is evidence that lexical recognition is used to solve the segmentation problem, supporting the interactionist view. Six-month-olds learn new words from utterances containing familiar names (Bortfeld et al., Reference Bortfeld, Morgan, Golinkoff and Rathbun2005). French eight-month-olds use function words such as des and mes for segmentation (Shi & Lepage, Reference Shi and Lepage2008) and infants at this stage can even make semantic associations with nouns (Bergelson & Swingley, Reference Bergelson and Swingley2012). It is clear that infants are able to recall familiar sound patterns and use them weeks later for segmentation (Jusczyk & Hohne, Reference Jusczyk and Hohne1997).
The problem with a model of speech segmentation that only considers lexical recognition lies in explaining how these familiar words are acquired in the first place; infants cannot have any innate assumptions about rhythmic and phonological regularities as these vary between languages (Cutler & Carter, Reference Cutler and Carter1987). One hypothesis is that these proto-lexicons are initially populated with single words spoken in isolation (Suomi, Reference Suomi1993). Indeed, in English Parentese (the particular register and style used by caregivers when talking to children), about one-tenth of utterances consist of isolated words (Brent & Siskind, Reference Brent and Siskind2001). The issue with this hypothesis is that there is no universal heuristic for identifying single-word utterances and many words will never occur in isolation. Brent and Siskind (Reference Brent and Siskind2001) claim that if entire multisyllabic utterances are initially added to the lexicon, lexical recognition alone could be sufficient for bootstrapping the lexicon. This claim is supported by a more recent study that found that the proto-lexicon of eleven-month-old French-learning infants contains both words and non-words (Ngon et al., Reference Ngon, Martin, Dupoux, Cabrol, Dutat and Peperkamp2013).
On the other hand, there is substantial empirical evidence to suggest that infants use a wide variety of sub-lexical cues to solve the initial segmentation problem. Many of these are based on the simple principle that predictability within lexical units is high, and predictability between lexical units is low (Harris, Reference Harris1955). It did not become clear that infants are able to use this principle for segmentation until the influential studies of Saffran, Aslin, et al. (Reference Saffran, Aslin and Newport1996a, Reference Saffran, Aslin and Newport1996b) and Saffran, Newport, et al. (Reference Saffran, Newport and Aslin1996). Following their study in adults, they found that infants as young as eight months calculate the transitional conditional probabilities of adjacent syllables A and B, defined as
where $ \Pr (AB) $ is the estimated probability of the syllable pair (calculated as the relative frequency) and $ \Pr (A) $ is the estimated probability of the syllable $ A $ , and use these to place word boundaries when the transitional probability is low (Aslin et al., Reference Aslin, Saffran and Newport1998; Saffran, Aslin, et al., Reference Saffran, Aslin and Newport1996a, 1996b).
These probabilities are also gathered at lower levels. At the phoneme level, for instance, differences in probabilities between within-word and across-word consonant clusters are used to segment novel phrases such as fang tine, as the pair of phones [ŋt] does not occur within English words (Mattys & Jusczyk, Reference Mattys and Jusczyk2001). At the lowest level, seven-and-a-half-month-old infants use their knowledge of allophonic variations to segment utterances, such as the variants of /t/ and /r/ that distinguish nitrate and night rate (Jusczyk, Hohne, et al., Reference Jusczyk, Hohne and Bauman1999).
Infants also seem to be sensitive to prosodic cues, those as young as 7.5 months learn to use the predictable strong-weak stress pattern in English (as in BAby) for segmentation (Cutler & Mehler, Reference Cutler and Mehler1993; Jusczyk, Cutler, et al., Reference Jusczyk, Cutler and Redanz1993; Jusczyk, Houston, et al., Reference Jusczyk, Houston and Newsome1999). While statistical cues may precede stress cues in their use (Thiessen & Saffran, Reference Thiessen and Saffran2003), stress and coarticulation cues are weighed more heavily by infants once adopted (Johnson & Jusczyk, Reference Johnson and Jusczyk2001). Stress alone is unlikely to be a universal cue for segmentation, as it is unclear whether all languages even provide reliable prosodic cues (Saffran, Newport, et al., Reference Saffran, Newport and Aslin1996). Indeed, it has generally been accepted that no single cue is solely responsible for solving the segmentation problem and that a complete model for explicit segmentation must consider information from multiple cues (Blanchard et al., Reference Blanchard, Heinz and Golinkoff2010; Christiansen et al., Reference Christiansen, Allen and Seidenberg1998; Çöltekin & Nerbonne, Reference Çöltekin and Nerbonne2014; Jusczyk, Reference Jusczyk1999).
Taking these accounts together, it is unclear whether initial segmentation in infants is purely explicit, or whether a combination of lexical and sub-lexical information is used. There are many overlapping and competing cues in these studies, so it is difficult to justify one view over the other. For example, segmentation around familiar words could be a result of phonotactic regularity rather than lexical recognition. This motivates the development of computational models in order to test hypotheses in isolation and therefore also solve the word segmentation problem. In particular, the DYMULTI framework developed in this study lets us test whether sub-lexical and lexical cues are alternative or complementary explanations for speech segmentation.
Segmentation models
Computational models for studying the segmentation problem are often designed to study one of two questions: (a) how statistical information can be used to segment speech, and (b) what computational problem is being solved.
These are often discussed using terminology from Marr’s computational theory of vision (Marr, Reference Marr1982): the first question operates at Marr’s algorithmic level, focusing on the algorithm, and the second operates at Marr’s computational level, focusing on the problem being solved.
Algorithmic-level studies are concerned with the implementation of algorithms that incorporate cognitively plausible mechanisms for the segmentation problem. These models propose efficient algorithms that follow three constraints:
-
1. They must start with no knowledge of the target language.
-
2. They must learn unsupervised.
-
3. They must operate incrementally.
The first constraint follows from the fact that all languages have different phonotactic constraints and vocabularies, yet children can learn any of them. The second is established because children are not always explicitly given the boundaries between words, so neither should computational models. The third follows from the fact that we process speech as it is heard, not in batches sometime later.
Numerous models have been proposed based on these constraints, taking a wide variety of approaches. Two broad categories stand out: boundary-finding methods and language modelling methods. These are somewhat related to interactionist and explicit views of speech processing, although top-performing models make use of both lexical and sub-lexical cues. Investigating these two approaches is the focus of this study. Note, however, that these models tend to operate on phonemic transcripts and so do assume some knowledge of the target language, not quite meeting the first constraint. However, this is still preferable to using orthographic transcriptions, as the phonemic forms are still representations of the sound signal.
By contrast, computational-level studies are concerned with defining the goal of segmentation and the logic of the strategy used to meet that goal. As the focus of these studies is not the algorithm, the models developed need not meet the three constraints. An example is the probabilistic model of Goldwater et al. (Reference Goldwater, Griffiths and Johnson2009), who find that the assumption that words are statistically independent units leads to under-segmentation by an ideal learner. As a computational-level study, their algorithm does not have to be cognitively plausible. It operates in batches over the corpus, using a hierarchical Dirichlet Process bi-gram model estimated using a Gibbs sampling algorithm. Besides the batch processing violating the third constraint for an algorithmic-level model, the computation time is also over 2000 times longer than most algorithmic-level models when presented with the same amount of data (Fleck, Reference Fleck2008). In this study, we do not work directly on these computational-level models, although many algorithmic-level models often offer insight at the computational level.
Boundary-finding methods for segmentation
Boundary-finding methods for segmentation relate to the explicit view of speech processing, that segmentation is driven by local information at each inter-phoneme position rather than lexical recognition. Models that use these methods follow directly from experimental studies. For example, Saksida et al. (Reference Saksida, Langus and Nespor2017) follow the findings of Saffran, Aslin, et al. (Reference Saffran, Aslin and Newport1996a), showing that children segment utterances at low points of transitional probability. Their unsupervised algorithm places boundaries between a syllable pair when the transitional probability of the syllable pair is lower than the two neighbouring pairs. Using syllables as the basic unit of segmentation is widely debated (Coltekin, Reference Coltekin2011) and also has a high-performing baseline since the vast majority of child-directed English words are monosyllabic (Gambell & Yang, Reference Gambell and Yang2006).
Earlier studies made use of connectionist models for infant segmentation, as was the trend for investigating many cognitive phenomena at the time (Christiansen et al., Reference Christiansen, Allen and Seidenberg1998; Elman, Reference Elman1990). As cognitively-plausible models for segmentation need to be unsupervised, these models could not be trained to predict word boundaries directly. Instead, they were often trained on an alternative task. Elman (Reference Elman1990) trained a recurrent neural network to predict phonemes, finding that relatively high error in prediction could indicate word boundaries. Cairns et al. (Reference Cairns, Shillcock, Chater and Levy1994) found that peaks in the error score could also indicate word boundaries. Finally, Christiansen et al. (Reference Christiansen, Allen and Seidenberg1998) developed a recurrent neural network to predict utterance boundaries, phonemes and lexical stress information in an utterance, finding that the prediction of an utterance boundary was a good indicator of a word boundary. This model allowed them to test these different cues together and in isolation, finding that the best performance was achieved when all three cues were combined.
Inspired by this model, Çöltekin and Nerbonne (Reference Çöltekin and Nerbonne2014) developed an explicit model for segmentation, arguing that it is difficult to interpret what connectionist models learn. Their model uses statistical information at each inter-phoneme position, as with transitional probability models, but extends this by introducing a cue-combination method to combine statistical information from multiple sources, also achieving far better performance than the connectionist models. This is the boundary-finding approach that we re-implement in this study.
Language modelling methods for segmentation
Language modelling methods are based on the interactionist view of speech processing, that segmentation and lexical recognition occur serendipitously, driven by lexical knowledge. These models typically build word $ n $ -gram models and use statistical criteria to define the best segmentation of an utterance, bootstrapping a lexicon that is then used for further segmentation.
Brent (Reference Brent1999) and Venkataraman (Reference Venkataraman2001) both developed probabilistic language models and used dynamic programming to infer the best segmentation. In Venkataraman’s model, the probability of a segmented utterance is given as the joint probability of all words in that utterance. The Viterbi algorithm is then used to find the segmentation that maximises an estimate of this probability. Word probabilities are approximated by $ n $ -grams, with a back-off procedure to lower-order $ n $ -grams. As the model processes more utterances, these probability estimates are refined and more words are added to the lexicon, further improving the model. These models produced state-of-the-art results unmatched by boundary-finding methods until later work (Çöltekin, Reference Çöltekin2017; Çöltekin & Nerbonne, Reference Çöltekin and Nerbonne2014; Fleck, Reference Fleck2008).
It is worth noting that none of the successful language modelling methods rely only on lexical recognition. A model that only matches utterances with previously-seen utterances will fail, as “short, frequently occurring utterances are likely to be segmented within larger word-level chunks resulting in an over-segmentation of words into their segmental phonology” (Monaghan & Christiansen, Reference Monaghan and Christiansen2010). For instance, if no has been added to the lexicon, then note could later be segmented as no and te, followed by increasingly smaller segmentations. As such, these models often gather sub-lexical statistics or make sub-lexical assumptions to prevent over-segmentation. The PUDDLE model, for example, uses word-initial and word-final phoneme clusters derived from its lexicon to restrict segmentation (Monaghan & Christiansen, Reference Monaghan and Christiansen2010). The model of Venkataraman (Reference Venkataraman2001) incorporates phoneme-level statistics to estimate the probability of unseen words. Finally, the model of Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010) extends Venkataraman’s model with an additional constraint that all segmented words must contain a syllabic nucleus. It is these latter two models that we re-implement and compare in this study.
Segmenting from raw speech
In this article, we focus on the abstract version of the word segmentation task, with utterances consisting of sequences of discrete, symbolic phonemes. Although this has remained an established paradigm for the study of word segmentation, in recent years the speech research community has made great advances in the area of zero-resource speech processing. These studies aim to develop unsupervised methods that learn from raw speech audio only, pioneered in recent years by the Zero Resource Speech Challenge (ZRC) series (Dunbar et al., Reference Dunbar, Hamilakis and Dupoux2022).
One of the four tasks presented by the ZRC series is Spoken Term Discovery, the text-less counterpart to word segmentation. The general approach proposed by ZRC is to first match speech fragments consisting of the same sequence of phonemes (the matching sub-task), then build a lexicon of word types (the lexicon discovery sub-task) and finally then use these to find word boundaries (the word segmentation sub-task). “Match-first” systems focus first on the matching problem, placing boundaries at the edges of discovered segments (Räsänen & Blandón, Reference Räsänen and Blandón2020). “Segmentation-first” systems prioritise the discovery of boundaries – for instance, by jointly optimising segmentation and building clustered word embeddings using Bayesian modelling (Kamper et al., Reference Kamper, Jansen and Goldwater2017) or by using self-expressive autoencoders to build a segmentation from matched learned acoustic units (Bhati et al., Reference Bhati, Villalba, Żelasko and Dehak2020). The most recent approaches do not even attempt to build a lexicon of types, either using a Bayesian approach directly on learned tokens (Algayres et al., Reference Algayres, Ricoul, Karadayi, Laurençon, Zaiem, Mohamed, Sagot and Dupoux2022) or by using peaks in surprisal across sequences of learned units (Kamper, Reference Kamper2023), similarly to traditional text-based boundary-finding methods for segmentation.
In this study, we were motivated by the availability of phonemic transcripts for 26 languages provided by Caines et al. (Reference Caines, Altmann-Richer and Buttery2019) to carry out novel cross-lingual analysis of word segmentation models. As the original audio recordings have not been made available for the majority of these transcriptions, we were unable to consider the models developed for ZRC series. However, many of the models presented in this study could operate on raw audio; their only requirement is that the units are discrete, so the input could easily be replaced with features derived from speech frame units. We discuss these ideas further at the end of the article.
Summary
Experimental psycholinguistic studies provide evidence that infants use sub-lexical statistical and speech cues for solving the segmentation problem, supporting the explicit view of speech processing. Other studies find that infants make use of lexical knowledge, supporting the interactionist view. To study the problem in a controlled manner, computational models have been designed to solve an abstract version of the problem, where continuous speech is represented as a series of symbolic phonemes. These models either explore what problem is being solved or present cognitively plausible algorithms for solving the problem. To be cognitively plausible, these algorithms must segment incrementally, start with no knowledge of the target language and learn unsupervised. Boundary-finding algorithms correspond to the explicit view of speech processing; and language modelling algorithms correspond to the interactionist view.
Implementation of Segmentation Models
In this section, we present our re-implementation of the state-of-the-art models for the boundary-finding approach of Çöltekin and Nerbonne (Reference Çöltekin and Nerbonne2014) and the language modelling approach of Venkataraman (Reference Venkataraman2001) and its extension presented by Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010). We discuss the benefits and drawbacks of these two approaches and produce a new model that combines their strengths.
Çöltekin and Nerbonne’s multiple-cue boundary-finding model
The model presented by Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) iterates through utterances phoneme-by-phoneme, placing boundaries by combining votes from a set of indicators based on a variety of cues. It is explicit in nature, although it does use statistical cues derived from the lexicon. We refer to this model as MULTICUE.
Cue combination algorithm
The core strength of MULTICUE lies in its cue combination algorithm, which allows for the consideration of an arbitrary number of psychologically-motivated boundary indicators. All of the cues are language-independent and the task of the algorithm is to determine how to use them without any supervision. As no single cue is solely responsible for the placement of word boundaries, this allows for a more comprehensive model for explicit segmentation.
Each boundary indicator labels every inter-phoneme position as either ‘boundary’ or ‘word internal’. The model then makes a final decision based on a variation of the weighted majority voting algorithm (Littlestone & Warmuth, Reference Littlestone and Warmuth1994). In Çöltekin (Reference Çöltekin2017), the following condition for deciding on the ‘boundary’ label is given:
where $ K $ is the number of boundary indicators, $ {w}_i $ is the weight and $ {1}_i $ gives the boundary decision for indicator $ i $ , equal to 1 for ‘boundary’ and 0 for ‘word-internal’. This is a conservative threshold, relying heavily on informed indicators. It also requires that the weights are all over 0.5 on average (otherwise the model would never be able to place a boundary). For instance, if all K boundary indicators were assigned a weight of 0.5, the model would never be able to place a boundary. This is because even if all indicators voted for a boundary, the weighted majority vote would only equal $ \sum \limits_i^K0.5=\frac{K}{2} $ .
Instead, our implementation places a boundary if the weighted vote for the ‘boundary’ label is greater than the weighted vote for the ‘word-internal’ label:
By noticing that the two sides sum to $ {\sum}_i^k{w}_i $ , and so the boundary can simply be placed if the normalised weighted vote exceeds $ 0.5 $ , this equation can also be rewritten as:
We discussed this with Çöltekin and he agreed that this boundary decision formulation is better, representing a more general case where we make no assumptions about the weights. He also noted that this makes the model more robust to bad indicators, but may favour recall over precision — recall indicating the retrieval of true boundaries and precision indicating how accurate a model’s predicted boundaries are.
The majority-vote algorithm is a common and effective method for combining multiple classifiers (Narasimhamurthy, Reference Narasimhamurthy2005). In this case, the weighted majority-vote is used so that votes from boundary indicators that make fewer errors have larger weights. As the model must be unsupervised, the ground-truth boundary locations cannot be used to update the weights. Instead, an error happens when an individual cue disagrees with the majority vote. At each inter-phoneme position, the incremental algorithm gathers votes from each indicator $ i $ , decides whether the position is a ‘boundary’ or is ‘word-internal’, and then increments the error count $ {e}_i $ for each indicator that disagreed with this decision. Finally, the weight $ {w}_i $ of each indicator is updated:
where $ N $ is the total number of inter-phoneme positions seen, producing weights in $ \left[-1,1\right] $ .
Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) state that this update rule “sets the weight of a vote that is half the time wrong to zero, eliminating incompetent voters” and that with this model, the success of boundary decisions depends on the precision of individual boundary indicators. In reality, this score is related to the accuracy of these indicators. As there are fewer true ‘boundary’ labels than ‘word-internal’ labels, an indicator that never places a boundary will achieve higher accuracy than an indicator that always places a boundary, so setting weights to zero when the accuracy is 0.5 is misleading. In our implementation, we use the following:
These weights are in the range $ \left[0,1\right] $ and are exactly equal to the accuracies of each indicator with respect to the final votes.
Cues and boundary indicators
Using the majority-voting framework, any number of indicators can be considered. describes a series of indicators derived from four sets of cues; predictability statistics, utterance boundaries, lexical stress and the lexicon, all deriving from psycholinguistic studies.
All of the indicators calculate a certain measure based on these cues. To propose boundaries, they use a partial-peak strategy. This is based on the peak strategy of transitional probability models where a boundary would be suggested if the transitional probability at an inter-phoneme position was lower than the transitional probabilities on either side of that boundary. Each cue is split into two indicators, splitting this peak in half. An example is given in Figure 1, where the first indicator proposes a boundary after a decrease in transitional probability and the other proposes a boundary before an increase in transitional probability. The model can then learn weights associated with each indicator, using the weighted majority-vote algorithm.
Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) also include indicators that calculate statistics over a larger context of three phonemes, capturing higher-order regularities, as well as indicators that calculated reverse measures, following the study of Pelucchi et al. (Reference Pelucchi, Hay and Saffran2009) that found that children can also use reverse transitional probabilities for segmentation. Çöltekin, (Reference Çöltekin2017) later used MULTICUE to explore various predictability cues in isolation. He found that the best performance was achieved when including indicators with a context size of one, two, three and four phonemes and also found that successor variety was a better predictability measure than transitional probability. Successor variety is the predictability measure originally described by Harris (Reference Harris1955) as being a good measure for predicting morpheme boundaries. For a set $ A $ of phonemes in the input language, the successor variety for a substring of phonemes $ l $ is given by:
where
We reimplemented all cues used by Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) as well as the predictability cues used by Çöltekin (Reference Çöltekin2017). Using a simple command-line interface, we re-implemented their models, which we henceforth refer to as MULTICUE-14 and MULTICUE-17 respectively. These models only vary in which indicators are included; MULTICUE-14 has 44 stress, predictability, lexicon and utterance-boundary indicators and MULTICUE-17 has 16 predictability indicators. For our reported results of MULTICUE-17, we use the successor variety predictability cue, as this was the measure that Çöltekin (Reference Çöltekin2017) found to give the best performance. See Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) and Çöltekin (Reference Çöltekin2017) for detailed descriptions of these cues.
An updated set of cues
Based on the success of using a variety of cues (Çöltekin & Nerbonne, Reference Çöltekin and Nerbonne2014) and the success of using higher-order $ n $ -grams and the successor variety cue (Çöltekin, Reference Çöltekin2017), we propose a new set of indicators that combine these ideas. This set consists of the successor variety cue of MULTICUE-17 and the lexicon and utterance boundary cues of MULTICUE-14. Indicators are created for $ n $ -gram values from 1 to 4. The stress cue is not included, following the finding of Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) that it decreases performance and also because the cross-lingual corpora that we use for evaluation do not provide stress alignment information.
We refer to the MULTICUE model using this new set of cues as MULTICUE-23. A summary of MULTICUE-14, MULTICUE-17 and the new MULTICUE-23 model is given in Table 1.
Venkataraman’s language modelling algorithm
Venkataraman’s model follows a language-modelling approach to segmentation. As a lexicon is developed and phonemic distributional statistics are learned, utterances are decoded using the Viterbi algorithm to find the maximum-likelihood segmentation. This is an interactionist approach as it is driven by lexical recognition rather than boundary placement.
Language model
A standard language model is used to calculate the likelihood of a segmentation. Given a segmentation $ \mathbf{W}={w}_1,\dots, {w}_n $ composed of $ n $ individual words $ {w}_i\in \mathbf{L} $ in a lexicon $ \mathbf{L} $ , the most likely segmentation $ \hat{W} $ is
To prevent underflow errors in computation, an equivalent calculation is made using log-likelihoods:
A common approximation when implementing language models is the $ n $ -gram approximation, collapsing these conditional probabilities to consider at most $ n-1 $ words. Venkataraman (Reference Venkataraman2001) makes a three-gram approximation, estimating $ P\left({w}_i|{w}_{i-2},{w}_{i-1}\right) $ with relative frequencies and using a back-off procedure to estimate the probability of unseen $ n $ -grams with lower order $ n $ -grams (Katz, Reference Katz1987). He uses a back-off technique given by Witten & Bell (Reference Witten and Bell1991), using phoneme 1-grams to estimate unseen words.
Venkataraman (Reference Venkataraman2001) implemented 1-gram, 2-gram and 3-gram models, finding a trade-off between precision and recall, with 1-grams giving the best performance overall. This is surprising, as $ n $ -gram contexts typically improve performance in such systems. Venkataraman claimed that this is because the 2-gram and 3-gram models are more conservative, as longer $ n $ -grams are more infrequent, leading to whole utterances being often inserted into their lexicons. As such, we only implement the 1-gram model.
Viterbi search
The language model defines the likelihood of a segmentation, but a search procedure is required to find the most likely segmentation. Exhaustive search is computationally intractable as there are $ {2}^{n-2} $ possible segmentations for an utterance of $ n $ phonemes, so this would be an unreasonable model for human segmentation. Instead, Venkataraman uses Viterbi search (Viterbi, Reference Viterbi1967) to decode each utterance, a dynamic programming algorithm that only explores $ {\left(n-2\right)}^2 $ segmentations.
The algorithm begins with an empty lexicon and no knowledge of phoneme frequencies, building these incrementally as each utterance is processed. The process is unsupervised, as no word boundaries are ever provided to the model. As such, all three constraints for algorithmic-level segmentation are satisfied.
Blanchard’s extended algorithm
Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010) extend Venkataraman’s 1-gram model to produce PHOCUS, for PHonotactic CUe Segmenter. They introduce two phonotactic cues: language-specific and language-universal. The first extends the unseen word estimate to use conditional probabilities of phoneme $ n $ -grams, rather than the phoneme 1-grams of Venkataraman. This cue is language-specific as phonotactic constraints (permissible phoneme combinations) vary between languages, so the phoneme $ n $ -gram probabilities must be learned. The models that keep track of phoneme $ n $ -grams are referred to as PHOCUS- $ n $ , with PHOCUS-1 being equivalent to Venkataraman’s model. For simplicity, we do not consider higher-order phoneme $ n $ -grams here.
The second cue is the universal constraint that words must have at least one syllabic nucleus. Syllabic nuclei in English consist of all vowels and some consonant sounds (such as the [l,m,ɹ,r] sounds in awful [ɔfl̩], rhythm [ɹɪðm̩], butter [bʌtɹ̩] and even [ivn̩]). There is much debate about the validity of syllables as a perceptual unit (Mehler et al., Reference Mehler, Dommergues, Frauenfelder and Segui1981; Räsänen et al., Reference Räsänen, Doyle and Frank2018; Ziegler & Goswami, Reference Ziegler and Goswami2005), but Blanchard et al. claim that this constraint is plausibly a prior that does not need to be learned, as it can be explained without making assumptions about the perceptual status of syllables: instead, this assumption only depends on sonority (for vowels) or manner of articulation (nasals, liquids) and the fact that every word requires at least one of these. To implement this constraint, probabilities of words that do not have a syllabic nucleus are set to 0. Adding this constraint to PHOCUS-1 gives PHOCUS-1S.
Full algorithm
PHOCUS-1S iteratively processes each utterance using the Viterbi algorithm to find the segmentation that maximises the product of estimated word probabilities. After segmenting each utterance, the proto-lexicon and phoneme counts are updated, improving the language model.
An example of this loop is given in Figure 2, where possible segmentations of the utterance andadoggy are considered. For the first possible segmentation, and adoggy, the probability of the word and is given by its relative frequency in the proto-lexicon. The word adoggy has not been seen before, so its probability is calculated using the relative frequencies of each of its symbols (graphemes in this example, phonemes in our experiments). For the second possible segmentation given in the example, andado gg y, the word gg contains no syllabic nucleus, so has a probability of 0, resulting in a probability of 0 for the whole utterance. Assuming 0.02 is the highest score out of all segmentations considered by the Viterbi algorithm, and adoggy would be selected as the best segmentation for this utterance.
Model summary
The PHOCUS-1 model of Venkataraman (Reference Venkataraman2001), extended by Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010) to produce PHOCUS-1S, uses a language model to define the probability of a segmentation based on seen-word frequency and phoneme frequency for unseen words. In PHOCUS-1S, probabilities of words not containing a syllabic nucleus are set to 0. A Viterbi algorithm finds the most-likely segmentation of an utterance. We implemented both PHOCUS-1 and PHOCUS-1S for comparison with the MULTICUE models. A summary of both models is given in Table 1.
DYMULTI: A combined segmentation model
MULTICUE is in principle a boundary-finding model for segmentation, but MULTICUE-14 does use indicators based on the lexicon, so it could be considered an interactionist model. PHOCUS is also an interactionist model, using a language model for calculating the probability of segmenting an utterance, but it does use sub-lexical information for estimating the probability of unseen words. Therefore, both models involve a complicated interaction of lexical and sub-lexical information, which is consistent with studies showing that infants use both sources of information for solving the segmentation problem. There are, however, drawbacks to both approaches.
One of the key benefits of MULTICUE is that it can combine an arbitrary number of sub-lexical boundary indicators. This is a good model for explicit segmentation, as experimental studies have shown that infants are sensitive to a wide variety of cues. PHOCUS, on the other hand, cannot consider an arbitrary number of sub-lexical indicators. This is not just a drawback of PHOCUS, but of any language modelling approach to segmentation. To add a new indicator, the entire language model would need to be redefined and this would be very difficult to do without making prior assumptions about the cues.
The strength of PHOCUS lies in the Viterbi search process. The segmentation of an utterance is decided at the lexical level, based on the scores assigned to each word in the segmentation. This means that it is easy to incorporate lexical-level constraints, such as the syllabic nucleus constraint of Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010). Such a constraint cannot be easily incorporated into the MULTICUE model, or indeed into any boundary-finding approach to segmentation, as boundary-finding models place boundaries independently of each other using only the local context around that boundary. Hence, the decision cannot depend on the placement of previous or future boundaries.
In this section, we present a new framework for segmentation models that combines the two approaches. It collects scores for each inter-phoneme position from multiple indicators using the weighted majority-vote algorithm of MULTICUE, then uses a modification of the Viterbi algorithm from PHOCUS to choose the best segmentation, rather than just placing boundaries greedily. This combined model allows for the consideration of multiple sub-lexical and lexical cues, addressing the drawbacks of both the boundary-finding and language-modelling approaches to segmentation. We name this framework DYMULTI for DYnamic programming MULTIple-cue model.
Using weighted boundary votes with the Viterbi algorithm
In DYMULTI, the Viterbi algorithm finds the best segmentation according to boundary scores rather than word scores. These boundary scores are adapted from the weighted majority-vote algorithm of the MULTICUE model, adjusting equation (1) to give a real-valued score instead of a binary decision:
where $ {w}_i $ are the weights for indicator $ i $ , as given by equation (2). These scores lie between -1 and 1 with scores over 0 indicating a boundary and scores close to 1 or -1 suggesting strong agreement between indicators.
The function returns the score at position $ j $ in the utterance where $ {1}_{ij} $ is the vote of indicator $ i $ at this inter-phoneme position.
We then adapt the Viterbi algorithm to maximise the sum of these boundary scores, rather than minimise the sum of negative log word probabilities. The word score function now simply returns score(j), the score given by the weighted majority-vote algorithm between phoneme $ j-1 $ and $ j $ . At the utterance boundaries, score(j) always returns 1.
Without any other changes, this algorithm simply places boundaries at every position where the score is greater than 0, as this maximises the sum over the utterance. As scores over 0 indicate where MULTICUE would have placed a boundary according to equation (2), this means that DYMULTI will act exactly like MULTICUE if the same indicators are provided. The difference with this new framework is that lexical-level processes can be introduced by adjusting the word score function, as described in the next two sections.
Introducing the require-syllabic-sound lexical constraint
The first lexical-level process we introduce to DYMULTI is the syllabic nucleus constraint of Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010). We adjust the word score function so that if the word has no syllabic nucleus, the function returns -100. This number is chosen to be far smaller than any positive sum could account for, similarly to the large negative log probability used to simulate a probability of 0 in our implementation of PHOCUS-1S.
An example of the DYMULTI framework segmenting the utterance andadoggy is given in Figure 3. First, the boundary scores for the utterance are calculated using equation (3). These scores are then used to calculate the score of each segmentation, using the Viterbi algorithm. For the segmentation and adoggy, the boundary scores to the left of the two words are 1.0 and 0.7, so the score for the segmentation is 1.7. As with PHOCUS-1S, the syllabic nucleus constraint prevents andado gg y from being a valid segmentation, giving a score of -100 to the word gg.
Introducing a lexical recognition model
Using the Viterbi algorithm, other lexical processes can also be introduced to DYMULTI. Here, we propose one such process: a rudimentary lexical recognition process to favour previously-seen words. This mirrors the lexical recognition driving many of the language modelling methods for segmentation (Blanchard et al., Reference Blanchard, Heinz and Golinkoff2010; Brent, Reference Brent1999; Monaghan & Christiansen, Reference Monaghan and Christiansen2010; Venkataraman, Reference Venkataraman2001).
This lexical recognition introduces a single parameter, $ \alpha $ , to DYMULTI. To favour previously-seen words, this process simply adds $ \alpha $ to the score of a word $ w $ if $ w\in L $ , where $ L $ is the proto-lexicon populated with words in previous segmentations. Reasonable values of $ \alpha $ lie in $ \left[0,1\right] $ , where $ \alpha =0 $ is equivalent to not using the lexical recognition process. Setting $ \alpha =1 $ has the effect of always trying to place boundaries around previously-seen words, as it will always return scores above 0 since $ \mathrm{score}(s)\in \left[-1,1\right] $ . Note that due to the syllabic nucleus constraint, these boundaries will not necessarily be placed, but there will still be a very strong bias towards them. Intermediate values of $ \alpha $ result in a balance between the lexical recognition process and the boundary-finding process.
The full word score function with both lexical processes is given in Figure 4.
Summary
The new DYMULTI framework addresses the drawbacks of the language modelling and boundary-finding approaches to segmentation. The model uses the weighted majority-vote algorithm of MULTICUE to produce scores that are then used to select the best segmentation using the Viterbi algorithm of PHOCUS. Using this dynamic programming algorithm, the model is able to incorporate lexical processes. We describe two such processes: the syllabic nucleus constraint from PHOCUS-1S (Blanchard et al., Reference Blanchard, Heinz and Golinkoff2010) and a rudimentary lexical recognition process that takes a single parameter to adjust the weighting given to previously-seen words. The full model efficiently combines multiple sub-lexical and lexical cues for segmentation, without the drawbacks of previous models.
A summary of the DYMULTI models used in this study is given in Table 1, highlighting how the concepts behind MULTICUE and PHOCUS have been combined. We implement DYMULTI with the cues from MULTICUE-14, MULTICUE-17 and our new MULTICUE-23 model.
Data and Evaluation
In this section, we discuss the procedure used to evaluate PHOCUS, MULTICUE and DYMULTI. This includes the data used, the baseline segmentation model and the evaluation metrics.
Corpora
To evaluate computational models for speech segmentation, it is customary to use transcriptions of real child-directed speech as input data. The first corpus we use in this study is the BR corpus, the de-facto standard for evaluating computational models for segmentation. It was originally collected by Bernstein Ratner et al. (Reference Bernstein Ratner1987) by recording the conversations between nine mothers and their children. It makes up part of the English section of CHILDES, a large database that contains orthographic transcriptions of speech between carers and children of a variety of ages in a multitude of languages (MacWhinney & Snow, Reference MacWhinney and Snow1985).
The BR corpus was later hand-processed by Brent and Cartwright (Reference Brent and Cartwright1996) to produce a phonemic transcription, keeping only child-directed utterances and removing onomatopoeia and interjections. They removed all word boundaries, keeping only utterance boundaries, for a total of 95,809 phonemes, 33,387 words and 9,790 utterances. The transcription system used is not standard, often combining diphthongs, r-colored vowels and syllabic consonants into a single symbol. As there are only 50 symbols used, there is an average of 2.9 phonemes per word. The lexical stress was later added by Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) semi-automatically according to stress patterns in the MRC psycholinguistic database. Examples of utterances from the corpus can be seen in Table 2. Segmentation models have the task of correctly placing word boundaries given these input utterances, without any supervision. For example, if the input is yuwanttusiD6bUk then the correct output is yu want tu si D6 bUk (you want to see the book). We also note that the corpus represent a tiny fraction of the total input available to children, which has been estimated to be between 2M and 7M words per year (Gilkerson et al., Reference Gilkerson, Richards, Warren, Montgomery, Greenwood, Kimbrough Oller, Hansen and Paul2017).
In this study, we also evaluate models cross-lingually, using phonemic transcriptions of child-directed speech from 26 different languages created by Caines et al. (Reference Caines, Altmann-Richer and Buttery2019). Their dataset consists of 132 monolingual CHILDES corpora, each containing 10,000 child-directed utterances aimed at children two years or younger. They processed these corpora using the eSpeak Next Generation (NG) speech synthesizer text analysis moduleFootnote 2 or segments grapheme-to-phoneme transformerFootnote 3 to produce phonemic transcriptions. We direct readers to their study for a full description of the corpora. Of these 132 transcripts, we select one for each languageFootnote 4, for a total of 26. This was done to facilitate comparison: as otherwise, the corpora for some languages would be much larger than others – there are 28,000 total utterances for North American English but only 10,000 utterances for Basque.
These transcriptions use the International Phonetic Alphabet which contains more symbolic phonemes than the alphabet used for the BR corpus. In these transcriptions, there is an average of $ 3.7\pm 0.7 $ phonemes per word due to the more fine-grained phonetic detail. This also varies between languages. For instance, the Turkish transcript has an average of 5.4 phonemes per word but the Cantonese transcript only has an average of 2.6 phonemes per word, reminding us that the notion of a “word” is not equal across languages.
It also must be stated that the BR corpus and these cross-lingual corpora represent an idealisation of the natural scenario. There is likely to have been a degree of error in the transcription stage. Additionally, the phonemic representations of the transcriptions are idealised productions based on dictionary pronunciation; orthographic words are phonemically transcribed in the same way each time they occur, regardless of context. Finally, the largest simplifying assumption made by working with phonemic transcripts is that infants at this age are able to segment speech into phonemes, requiring them to both group phone realisations as phonemes and to have access to phone boundaries, neither of which is a simple task. We discuss the validity of this assumption and the implications for future work in child language acquisition at the end of the paper.
Evaluation metrics
We report each model’s performance standard measures; precision, recall and F1-score as:
TP is the number of true positives identified by the model, FP is the number of false positives (items identified by the model that are incorrect with respect to the gold standard) and FN is the number of false negatives (items missed by the model). The F1-score is calculated as the harmonic mean of precision and recall, providing a single balanced measure. As is conventional, we report F1-scores as percentages.
Studies of computational segmentation report these measures in three different ways (Brent, Reference Brent1999; Çöltekin & Nerbonne, Reference Çöltekin and Nerbonne2014), BOUNDARY, TOKEN and TYPE:
Boundary scores
TP, FP and FN are calculated according to the boundaries placed. For instance, TP is the number of correctly identified boundaries. This gives BP, BR and BF for the BOUNDARY PRECISION, BOUNDARY RECALL and BOUNDARY F1-SCORE. Note that utterance boundaries are not included in these calculations, as these are assumed to be trivial to place.
Token/word scores
These are stricter measures that indicate how well word tokens have been identified in the speech stream. As such, true positives are counted only if both boundaries of a word are found without an intervening boundary
between them. These scores are necessarily lower than the boundary scores. This gives WP, WR and WF for the word precision, word recall and word F1-score. Note that these include utterance-initial and utterance-final words.
Type/lexicon scores
These are similar to word scores, but true positives are marked over word types rather than word tokens, so are not skewed by the frequency of each type. This is done by comparing the final lexicon learned by the model to the expected lexicon, the true set of word types in the corpus. If the model is better at segmenting high-frequency words, the lexicon scores will be lower than the word scores. The lexicon scores are LP, LR and LF for the lexicon precision, lexicon recall and lexicon F1-score.
Finally, we report two error measures to give insight into how the models may fail, related to two of the error types a model may make. First, a model may miss a boundary, causing under-segmentation. Second, the model may place a boundary where there should not be one, causing over-segmentation. As the simple error counts will change depending on the size of the corpus and as there are many more word-internal positions than boundaries, normalised measures are used for under-segmentation $ ({E}_u $ ) and over-segmentation $ ({E}_o $ ):
where TP, FP and FN are the quantities used for the boundary measures and TN gives the true negatives (the total count of correctly placed word-internal positions). Intuitively, $ {E}_u $ gives the fraction of boundaries marked as word internal and $ {E}_o $ gives the fraction of word internal positions incorrectly marked as boundaries.
Evaluation procedure
In the machine learning literature, models are typically evaluated by training them on one section of the corpus then testing them on another. Computational models for segmentation, however, are unsupervised. Thus, we follow the procedure outlined by previous studies, training our models on a single run of the whole corpus and reporting the average scores across this training period. Although models improve as they learn, they cannot correct past mistakes, resulting in lower average scores than if a test-train split were used.
As a baseline model, we implemented BASELINE, which assigns boundaries randomly but with the correct probability (the true proportion of word boundaries). BASELINE is therefore more informed than a truly random classifier, as this probability is difficult to estimate. This has been the customary baseline for evaluating segmentation models since Brent & Cartwright (Reference Brent and Cartwright1996).
Most previous studies only report their results on one run of the unshuffled corpus, in order to best represent the input that children receive, but some studies report results averaged over multiple shuffles, as is standard practice when performing empirical evaluation of machine learning systems. In order for our study to be as thorough and consistent with previous work as possible, we report both types of results. All models reported here are deterministic, so only one run is needed per shuffle. Comparing unshuffled to shuffled results has the additional benefit of isolating the effect of utterance order, allowing us to identify if parents unknowingly bias the ordering of utterances spoken to their children to increase learnability.
Results
In this section, we first evaluate our re-implementations of MULTICUE and PHOCUS. We compare older sets of cues to the new set of cues used in MULTICUE-23. We then evaluate DYMULTI against PHOCUS and MULTICUE by comparing the average performance, learning rates across the BR corpus and different values of the lexical recognition constant $ \alpha $ . We then compare DYMULTI to previous studies to place its performance in context. Finally, we perform cross-lingual evaluation, comparing DYMULTI, PHOCUS and MULTICUE on 26 different languages. This marks the first time that state-of-the-art segmentation models have been compared on so many languages (Caines et al. (Reference Caines, Altmann-Richer and Buttery2019) compared baseline models with one state-of-the-art model).
In the analysis below, significance is tested using a pairwise t-test, $ \alpha =0.001 $ using samples collected from running each model over 10 shuffles of the input data. When not otherwise stated, we set the DYMULTI’s lexical recognition parameter $ \alpha $ to 0.
Reimplementation of PHOCUS and MULTICUE models
MULTICUE models
Table 3 gives the results for MULTICUE, run with different sets of indicators. Our implementations of MULTICUE-14 and MULTICUE-17 achieve similar scores to Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) and Çöltekin (Reference Çöltekin2017), with slightly higher error rates than the reported results but far exceeding the baseline. These differing error rates are likely due to fine-grained implementation differences, such as how probability estimates are calculated and how utterances are internally represented.
Note. BP, BR and BF stand for boundary precision, recall and F1-score. W and L scores are similar for the word and lexicon measures. $ {E}_u $ and $ {E}_o $ give over-segmentation and under-segmentation. The highest scores and lowest error rates (not including referenced results) are given in bold. Italicised lines give the scores reported for each model in their corresponding studies. MULTICUE-14/S is the MULTICUE-14 model without the stress cue. The models are only run once on the unshuffled corpus to facilitate direct comparison with the corresponding reported scores.
a As reported by Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014).
b As reported by Çöltekin (Reference Çöltekin2017).
Table 3 also compares running MULTICUE-14 without the stress cue, as Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) found the stress cue to decrease performance. The model achieves slightly lower scores than the published results in this case, with a higher under-segmentation error rate. As such, we can confirm the finding of Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014), that including the stress cue leads to worse overall performance. Also included are the results of MULTICUE-23, whose set of indicators combines the strengths of the MULTICUE-14 and MULTICUE-17 models. This set of indicators clearly leads to substantial improvements, with MULTICUE-23 achieving better F1-scores than MULTICUE-14 and MULTICUE-17.
PHOCUS models
The performance of the two PHOCUS models is given in Table 4. Reference rows are not available because Venkataraman (Reference Venkataraman2001) and Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010) use different evaluation schema. Venkataraman reports only WP, WR and LP for PHOCUS-1, averaged over 100 shuffles of the corpus, giving 67.7, 70.2 and 52.9 respectively. Averaging over 10 shuffles, we get $ 69.2\pm 2.9 $ , $ 67.2\pm 2.5 $ and $ 47.4\pm 1.2 $ respectively. Deriving a WF score from his WP and WR scores gives 68.9, which is close to our WF score of $ 68.2\pm 2.6 $ .
Note. BP, BR and BF stand for boundary precision, recall and F1-score. W and L scores are similar for the word and lexicon measures. $ {E}_u $ and $ {E}_o $ give over-segmentation and under-segmentation. The highest scores and lowest error rates are given in bold.
a Scores averaged over ten shuffles of the BR corpus.
Blanchard et al. only give the WF score of PHOCUS-1S, reporting WF = 80, but do not include the first 1000 utterances in this calculation. Replicating this, we achieve a very close WF score of 80.8. Overall, our implementation of these models seems to perform similarly to the original studies.
Comparing the scores achieved by our implementations of the two models, it is clear that the syllabic nucleus constraint introduced in PHOCUS-1S leads to a significant increase in all F1-scores, showing the benefit of a boundary-finding model that can use this constraint. Significance scores are given in Table 5. The over-segmentation error rate is halved from 6.1 to 3.0, a significant reduction $ (t=-12.4 $ , $ p=5.8\times {10}^{-07} $ ), indicating that this constraint is preventing boundaries from being placed at word-internal positions that would otherwise lead to producing words without syllabic nuclei.
Note. BF, WF and LF stand for boundary, word and lexicon F1-scores. Each pairwise comparison considers the same set of indicators (e.g., DYMULTI-14 uses the same set of indicators as MULTICUE-14). We set $ \alpha =0 $ for the DYMULTI models. Each model is run on ten shuffles of the BR corpus and scores are paired for each shuffle. All scores are significant at the $ p<0.001 $ level except for the final test.
Table 4 also reveals that F1-scores decrease when the input corpus is shuffled. This indicates that the specific ordering of utterances in the BR corpus is useful for segmentation. As the ordering of utterances comes from real child-directed speech, this suggests that parents may positively bias the ordering of utterances spoken to their children to assist with segmentation, such as pairing new word types with previously-uttered word types.
Performance of DYMULTI
Comparing DYMULTI to MULTICUE when using the same set of indicators
Table 6 gives the full scores comparing MULTICUE to DYMULTI, considering just the syllabic nucleus constraint. Every F1-score is significantly improved using the DYMULTI model, with scores given in Table 5. DYMULTI-23 achieved the best F1-scores; 89.5, 81.8 and 51.9 for BF, WF and LF respectively. Generally, using the DYMULTI model significantly decreases the over-segmentation error rate of the MULTICUE models. This confirms that the syllabic nucleus constraint alone is a useful addition, correctly preventing the model from placing erroneous boundaries. The increase in WF and LF scores shows that the Viterbi algorithm, considering this constraint, is finding the correct words to segment, leading to a more accurate lexicon.
Note. BP, BR and BF stand for boundary precision, recall and F1-score. W and L scores are similar for the word and lexicon measures. $ {E}_u $ and $ {E}_o $ give over-segmentation and under-segmentation. Each pairwise comparison considers the same set of indicators (e.g., DYMULTI-14 uses the same set of indicators as MULTICUE-14). We set $ \alpha =0 $ for the DYMULTI models. Each model is run on ten shuffles of the BR corpus, averaging scores, with the highest scores and lowest error rates in bold.
Learning rates of segmentation models
The scores in Table 6 were calculated by taking the average performance of each model over the whole BR corpus, including the many initial mistakes made as the models gather statistical information. While this gives an indication of how well each model learns to segment after beginning with no knowledge of the target language, it can be more informative to see how the performance of each model progresses across the corpus.
Figure 5 gives the WF and LF learning rates for a selection of the models described in this study. These models initially perform very poorly, but quickly improve over the first 1000–2000 utterances, after which scores do not increase or decrease by more than 10 points. This is expected, as the models begin with very poor representations of the target language and so make poor boundary decisions. As such, the average scores over the whole corpus are not representative of the final performance of each model. For example, over the whole corpus MULTICUE-14 has an average WF score of $ 75.4\pm 0.5 $ and MULTICUE-23 has an average WF score of $ 78.2\pm 1.0 $ but there is no significant difference between their WF scores on the final block of 200 utterances. It seems that MULTICUE-14 initially learns very slowly, likely due to a large number and a high variety of indicators that need to be learned.
From these learning rates, we also see the consistent benefit of the syllabic nucleus constraint. At every stage in the learning process, PHOCUS-1S achieves higher F1-scores than PHOCUS-1 and DYMULTI-23 achieves higher F1-scores than MULTICUE-23. Indeed, DYMULTI-23 achieves the highest F1-scores out of any model at almost every stage of the learning process, achieving WF and LF scores of $ 86.3\pm 1.7 $ and $ 82.9\pm 2.0 $ respectively on the final 200 utterances of the BR corpus. This confirms the validity of this model and the benefits of combining the boundary-finding and language modelling approaches to segmentation.
Lexical recognition process
Figure 6 compares MULTICUE to DYMULTI, considering three values for the lexical recognition parameter $ \alpha $ . In all cases but one, the DYMULTI model performs significantly better than the MULTICUE model when both models use the same set of indicators (see Table 5 for the significance test results). It seems the LF scores are more sensitive to this change in model than the WF scores. The LF score for MULTICUE-14, for instance, jumps by more than 20 points when DYMULTI is used with $ \alpha =0.5 $ . This suggests that the lexical processes of DYMULTI help capture infrequent words.
The learning rates for these models, however, tell a different story. Figure 7 compares the learning rates of these models, revealing that the relatively higher average WF and LF scores over the corpus for DYMULTI with $ \alpha =0.5 $ and $ \alpha =1 $ given in Figure 6 are actually due to a very steep initial learning rate. Indeed, the final WF and LF scores for these two values of $ \alpha $ are actually lower than DYMULTI-23 with $ \alpha =0 $ and are in fact lower than MULTICUE-23. It seems that segmenting on the basis of previously-seen words is a useful strategy at the very start of learning to segment while the boundary cues are still gathering statistical information. As the boundary-based process improves, this lexical recognition procedure actually harms the model, leading to a decrease in WF and LF over time. We also experimented with smaller values of $ \alpha $ but never reached significantly higher F1-scores by the end of learning.
We believe that this is because the lexical recognition process described in this study is very rudimentary, simply adding a fixed value to the score when a word is recognised. While the model is gathering statistical information, the boundary-based votes are likely to be inaccurate and extreme, so the lexical recognition process will help prevent incorrect segmentation. As the boundary-based votes become more fine-grained, however, the fixed score of the lexical recognition process will dominate. This will prevent the boundary-based votes from discovering new words, explaining the decrease in LF over time. A more nuanced lexical recognition process should account for this, perhaps by including a decay parameter that decreases $ \alpha $ gradually, relying less on the lexical recognition process as the boundary-based votes become more stable predictors. Using DYMULTI, it is easy to explore the inclusion of such a lexical process, without needing to define or run a new model.
Comparison to previous studies
Table 7 compares DYMULTI-23 and MULTICUE-23 to other models in the word segmentation literature. Note that these models differ in terms of the evaluation procedure. The first four models are incremental, so scores are calculated over a single run of the BR corpus (averaging over 100 independent runs over shuffles of the corpus in the case of Venkataraman (Reference Venkataraman2001)). The next two models are incremental, but run over the corpus multiple times and only report the results for the final run. The following two models are batch-based, so scores are calculated after many iterations of training over the corpus (ranging from two to several thousand). DYMULTI-23 with $ \alpha =0 $ , achieves higher BF and WF scores than all of these, with a comparable LF score. Using $ \alpha =0.5 $ results in the highest LF score, but this is potentially misleading, as discussed in the previous section. It is also interesting that DYMULTI-23 outperforms the several-run models of Fleck (Reference Fleck2008); Ma et al. (Reference Ma, Çöltekin and Hinrichs2016) and the batch-based models of Elsner & Shain (Reference Elsner and Shain2017); Goldwater et al. (Reference Goldwater, Griffiths and Johnson2009) as these do not suffer from the lower performance in the initial learning phase. It is also worth noting that our implementation of PHOCUS-1S already outperforms most previous studies.
Note. MULTICUE-23 and DYMULTI-23 are compared to a variety of the top-performing models from the child word segmentation literature. BP, BR and BF stand for boundary precision, recall and F1-score. W and L scores are similar for the word and lexicon measures. $ {E}_u $ and $ {E}_o $ give over-segmentation and under-segmentation. Scores are obtained on the BR corpus, with the highest scores and lowest error rates in bold. If there were multiple models reported in a study, the model with the highest LF score is given. The scores across models are not always directly comparable, as some are calculated differently from others.
a Our implementation of PHOCUS-1S is used as a stand-in for the model of Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010) as they only report WF.
Many previous studies only explore one or two cues for segmentation, stating that they expect a model that considers more cues to perform better (Blanchard et al., Reference Blanchard, Heinz and Golinkoff2010; Çöltekin, Reference Çöltekin2017; Ma et al., Reference Ma, Çöltekin and Hinrichs2016). The set of cues chosen here seems particularly effective, with MULTICUE-23 already achieving higher BF and WF scores than in previous studies. The LF score for DYMULTI-23 with $ \alpha =0 $ is still lower than many of the other models in Table 7, only outperforming the models of Çöltekin (Reference Çöltekin2017); Fleck (Reference Fleck2008); Ma et al. (Reference Ma, Çöltekin and Hinrichs2016) that do not store a lexicon at all, but setting $ \alpha =0.5 $ remedies this, leading to the highest LF score.
This comparison confirms both the strength of the boundary cues included and the strength of the syllabic nucleus constraint at the lexical level.
Cross-lingual evaluation
Comparison of PHOCUS, MULTICUE and DYMULTI across 26 languages
The majority of studies that present models for child word segmentation only report results on English transcripts, typically only using the BR corpus. Exceptions include Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010), who also report results on a Sesotho corpus, Fleck (Reference Fleck2008), who also reports results on Spanish and Arabic corpora, and Caines et al. (Reference Caines, Altmann-Richer and Buttery2019), who compared three different segmentation models on 26 languages. As these models are designed to represent the ability of a child to acquire any language, proper evaluation is incomplete if the models are not run on a wide variety of languages. Otherwise, the models could be biased towards learning English, and therefore not represent a truly universal language acquisition procedure.
Figure 8 compares the LF scores of PHOCUS-1S, MULTICUE-17 and DYMULTI-23 across 26 languages. These transcripts come from the study of Caines et al. (Reference Caines, Altmann-Richer and Buttery2019). The models they evaluate in their study either perform significantly worse than those presented here or process the transcripts several times, so are not comparable to these three single-run and incremental state-of-the-art models. To account for the initial learning curve which may differ between models and languages, we only report the scores achieved over the last 5000 utterances of each transcript, after the models have stabilised.
For 15 of the 26 languages, DYMULTI-23 achieves a significantly higher LF score than MULTICUE-17 and PHOCUS-1S (according to a paired t-test, with $ p<0.001 $ ). This is significantly above the chance level of $ \frac{26}{3} $ for this experimentFootnote 5. If we order the languages by LF score, English comes second for MULTICUE-17 and PHOCUS-1S and fourth for DYMULTI-23. As DYMULTI builds on MULTICUE and PHOCUS, which themselves build on previous work, this suggests that the research into child segmentation has been biased towards performance on English corpora. Also note that PHOCUS-1S achieves a higher LF score than DYMULTI-23 for English, which was not the case for the BR corpus (as seen in Figure 5). This highlights the importance of testing on multiple corpora, as the transcription system seems to have an effect on the performance of the models.
Correlating DYMULTI performance with language features
As a preliminary investigation into why DYMULTI-23 performs better on some languages than others, we grouped the languages by their sub-families using Glottolog (Hammarström et al., Reference Hammarström, Forkel, Haspelmath and Bank2022). Any languages that did not share a sub-family with any other languages were classed as ‘Other’. The resulting groups are:
-
• Sinitic (Cantonese and Mandarin)
-
• Germanic (Danish, Dutch, English, German, Icelandic, Norwegian and Swedish)
-
• Balto-Slavic (Serbian and Croatian)
-
• Italic (French, Italian, Portuguese, Romanian and Spanish)
-
• Other (Basque, Estonian, Farsi, Greek, Hungarian, Indonesian, Irish, Japanese, Korean and Turkish)
All languages in the Italic group are also Romance languages (a modern subgroup of Italic languages). These languages have a number of shared features; they are moderately inflecting, have a primarily subject-verb-object word order and accent with stress. Germanic languages also accent with stress but vary when it comes to inflection; German and Icelandic have complex inflectional morphology whereas English and Swedish are largely analytical. Germanic languages also have verb-second word order (English is an exception), unlike other families. Serbian and Croatian are mutually intelligible varieties of Serbo-Croatian, a highly inflectional language that has a flexible word order, often defaulting to subject-verb-object as with Italic languages. Serbo-Croatian also has a simple tone system. The Germanic, Balto-Slavic and Italic groups are all divisions of the Indo-European family. The Sinitic group, on the other hand, is a sub-division of the Sino-Tibetan family. These languages have relatively simple morphology, with no inflections or conjugations. The basic word order is subject-verb-object and modifiers usually precede the words they modify. One distinguishing feature of Sinitic languages is the use of tones to distinguish words. Of the languages in the ‘Other’ group, Greek, Farsi and Irish are also Indo-European, members of the Hellenic, Indo-Iranian and Celtic sub-families, respectively. The remaining languages are all members of different top-level families; Koreanic (Korean), Uralic (Hungarian, Estonian), Japonic (Japanese), Turkic (Turkish) and Austronesian (Indonesian). Basque is a language isolate, not classified into any family (King, Reference King2018). Figure 9 presents DYMULTI-23 LF scores with languages grouped using this classification. We see that DYMULTI-23 performs better on all Sinitic and Germanic languages than any Balto-Slavic or Italic language. This suggests that the features of a language family can predict how easy it is to learn the segmentation of the languages within that family.
To investigate this further, we extracted the structural properties of each language using the World Atlas of Language Structures (WALS) database (Dryer & Haspelmath, Reference Dryer and Haspelmath2013). WALS uses 192 grammatical, phonological and lexical features to describe the cross-linguistic diversity, each feature taking between 2 and 28 values. We then calculated the feature-value pairs that best predicted the DYMULTI-23 LF score achieved for each languageFootnote 6.
Figure 10 presents the best ten of these feature-value pairs. We see that out of languages in which the adjective precedes the noun, languages in which the object precedes the verb (line 6) tend to be harder to segment than languages in which the verb precedes the object (line 7). Six of the most predictive features are to do with word order, which is significantly higher than expected, given that out of the 510 feature-value pairs considered, only 123 are to do with word orderFootnote 7. Korean and Japanese share five of the word order features and DYMULTI-23 achieves the lowest performance on these two. As many of the cues used in DYMULTI-23 identify infrequent phoneme combinations across word boundaries, it is understandable that languages with particular word order features would be easier to segment than others with this model.
For more fine-grained analysis, we repeated this experiment considering just languages in the Balto-Slavic, Germanic and Italic sub-families. The results are presented in Figure 11. Once again, six of the ten features are to do with word order. The Italic languages have Noun-Adjective word order, whereas the Balto-Slavic and Germanic languages have Adjective-Noun word order. There are also two phonological features selected. The Italic languages all have ultimate or penultimate stress. Given that DYMULTI-23 does not use a stress cue, this could explain the relatively low performance on Italic languages; perhaps for these languages, with their relatively freer word order, stress is a crucial cue for segmentation.
Correlating DYMULTI performance with information-theoretic measures
As the cues used by DYMULTI-23 are largely statistical, we decided to compare the information-theoretic properties of each language. We calculated the following measures:
-
• the unique number of phonemes in each transcript,
-
• the entropy of phoneme $ n $ -grams, and
-
• the conditional entropy of phoneme $ n $ -grams.
The first measure was chosen as one of the most basic metrics for phonological complexity (Nettle, Reference Nettle1995). Historically, the number of vowels or the number of consonants have also been used as count-based measures (Moran & Blasi, Reference Moran and Blasi2014). Count-based measures are simple to calculate but do not consider the phonotactics of a language. Entropy measures capture the phonological complexity of each language according to the predictability of the phonemes in that language, inherently capturing the nuanced interactions that the count-based measures do not (Piantadosi et al., Reference Piantadosi, Tily and Gibson2011; Pimentel et al., Reference Pimentel, Roark and Cotterell2020). When calculating these measures we use the target segmentation transcripts, which include spaces, so across-word and within-word probabilities are both considered.
Figure 12 presents DYMULTI-23 LF scores plotted against the 3-gram conditional entropy of each language transcript used. There is a strong negative correlation of -0.65. This suggests that the more predictable a phoneme is within context, the more capable DYMULTI-23 is of identifying boundaries.
Figure 13 gives the correlations between each information-theoretic measure and each of the three F1-scores that DYMULTI achieves. It is clear that the information-theoretic statistics of each language are strongly linked to the performance of segmentation models. For Italic languages, whose phonemes are not as predictable within context compared to Germanic languages, additional cues such as stress may be required for successful segmentation.
Summary
In this section, we evaluated PHOCUS, MULTICUE and the new DYMULTI framework. Our re-implementations of PHOCUS and MULTICUE achieved similar performance to their original studies, successfully replicating their findings. We also found that our new set of cues for MULTICUE significantly improves its performance.
Comparing DYMULTI to PHOCUS and MULTICUE, we found that the syllabic nucleus significantly improved performance over MULTICUE models using the same set of cues. The lexical recognition process also improved performance, but this was due to a very fast initial learning period; performance actually decreased over time when $ \alpha $ was set to 0.5 or 1. DYMULTI also outperformed prior work, including models that used batch training or other weaker constraints.
Finally, we performed cross-lingual evaluation, which has not been done at this scale for state-of-the-art segmentation models. This validated the performance of DYMULTI-23, which outperformed PHOCUS-1S and MULTICUE-17 on 15 of 26 languages. We analysed the structural properties of each language and found that word order features are predictive of DYMULTI-23 performance. We then performed information-theoretic analysis of the language transcripts, finding several correlations between these and the performance of DYMULTI-23 on these languages. Taken together, the above two findings mean that for less predictable languages (which tend to have freer word orders), additional cues such as stress may be needed for successful segmentation.
Discussion and Summary
In this study, we explored both boundary-finding and language modelling methods for word segmentation, producing a new segmentation framework, DYMULTI, that combines the powerful boundary decisions from the MULTICUE framework of Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) with the lexical constraints of the PHOCUS-1S model of Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010). In this section, we first consider the performance of DYMULTI with respect to the different views of speech processing. We then discuss our re-implementations of PHOCUS and MULTICUE and the novel benchmarking that we have carried out in this study. Finally, we describe the limitations of this study and future directions, concluding with the wider implications of this research.
Performance of DYMULTI
We presented two views of speech processing from the cognitive science literature. The interactionist view states that speech segmentation is driven by lexical recognition. The explicit view states that segmentation is purely a result of placing boundaries using sub-lexical information, without making use of any lexical influences.
Our goal was to compare the language modelling and boundary-finding approaches to solving the speech segmentation problem (which relate to these views of speech processing) and to establish if combining these approaches would lead to improvements in performance on transcriptions of child-directed speech. To achieve this, we first reimplemented the PHOCUS models of Venkataraman (Reference Venkataraman2001) and Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010) and the MULTICUE models of Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) and Çöltekin (Reference Çöltekin2017). Both models can make use of sub-lexical and lexical information, but PHOCUS primarily uses lexical cues and MULTICUE primarily uses sub-lexical cues. As such, these models corroborate the studies in the cognitive science literature that find that children make use of sub-lexical and lexical cues for solving the word segmentation problem.
Until this study, it was not possible to conclude whether sub-lexical and lexical cues were complementary or alternative explanations for segmentation, as no model had been designed that was able to efficiently combine lexical and sub-lexical cues without constraint. Presenting the DYMULTI framework, we confirmed an improvement over PHOCUS and MULTICUE models. This implies that sub-lexical and lexical cues are indeed complementary and that both can be helpful for solving the word segmentation problem.
We also found that DYMULTI-23 outperforms previous state-of-the-art systems for segmentation on the BR corpus, including those that run over the corpus several times or learn in batch. As DYMULTI makes use of cues found to be used by infants for speech segmentation and builds on established, state-of-the-art models, this means that DYMULTI is a good representation of how children may learn to segment speech and begin to build their lexicon.
Novel benchmarking of state-of-the-art models
Besides presenting a new state-of-the-art model for segmentation, a major contribution of this study was the thorough benchmarking and replication of the PHOCUS and MULTICUE frameworks. Replication is an important scientific discipline and few state-of-the-art models have been re-implemented in prior work. Re-implementing these frameworks, we achieved comparable results to their respective studies. Running MULTICUE-14 without the stress cue, we confirmed the result of Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) that it increased performance. We also validated the finding of Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010) that the syllabic nucleus constraint improves performance, using this as the core motivation for the design of DYMULTI.
Despite most studies in this field using the same corpus for evaluation, they all evaluate their models differently, making cross-comparison difficult. This is also the case for the PHOCUS and MULTICUE frameworks. For example, the PHOCUS studies do not provide boundary scores and the MULTICUE studies do not provide the learning rates of their models. In this study, we compared these models with a wide range of experiments, including calculating average scores over the whole corpus, plotting learning rates over time and performing novel cross-lingual evaluation. This is the first time these models have been directly compared, producing a useful survey of the field. The cross-lingual evaluation is particularly noteworthy, as few state-of-the-art models have previously been compared on more than two languages. This needs to become a regular practice if the goal of these models is truly to understand how any language is acquired, not just English.
Limitations of child-directed corpora
One of the strengths of the DYMULTI framework is that it is much more flexible than previous models as it can easily consider multiple boundary-based cues and lexical processes. This allowed for the combination of phonotactic, utterance-boundary, lexicon and stress cues derived from phonetic transcriptions of child-directed speech.
The BR corpus is the de-facto standard for evaluating such computational models, but it has certain limitations. Containing only 9,790 utterances spoken by only nine speakers of U.S. English from the east coast in 1987, it is not very representative of child-directed speech in English and all its varieties, yet alone other languages. To validate our results, we ran PHOCUS, MULTICUE and DYMULTI on corpora from 26 different languages.
Through this cross-lingual evaluation, we found that the models perform consistently better on English than on most other languages. This is another limitation of using a single corpus for cross-comparison, as it suggests that previous work may have been biased towards performance on the BR corpus, producing models that perform well on English at the expense of other languages. For example, Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014) found that including the stress cue decreased the performance of MULTICUE, but this may only be the case for English. In our cross-lingual analysis we found that DYMULTI-23 performs worse on Italic languages than Germanic languages. It may be that with the freer word order and lower phonemic predictability of Italic languages, stress cues play a more important role in segmentation. Using the DYMULTI framework, this could be investigated in future work by experimenting with different cues and implementing different lexical processes to see how these choices alter the performance across languages.
Another limitation of these corpora is the fact that they represent speech as symbolic phonemes. Not only could these transcripts be subject to translation error and bias, due to the way they were automatically produced, but it is also unclear if infants have access to phonetic categories at this stage in acquisition. There is increased awareness that infants may not be forming phonetic categories until later in life (Feldman et al., Reference Feldman, Goldwater, Dupoux and Schatz2021) and there is mounting evidence that we do not fully collapse speech to discrete phonemic categories for higher-level processing, as assumed by the “myth” of categorical perception (see McMurray (Reference McMurray2022) for a review). If infants do form categories, they may even be more fine-grained and numerous than phonemes (Schatz et al., Reference Schatz, Feldman, Goldwater, Cao and Dupoux2021).
In this study we used phonemes as our base unit, focusing on lexical and sub-lexical cues, but it is clear that future work should consider incorporating sub-phoneme cues as well. If a corpus containing continuous audio were used, these cues could be extracted, providing allophonic variation and realistic stress information to the model. This was attempted by Rytting et al. (Reference Rytting, Brew and Fosler-Lussier2010), who used the raw audio associated with the Brent corpus (which is not aligned with the phonemes in the BR corpus) to represent the input stream as phone probability vectors, thus preserving phonetic variation. Unfortunately, they do not make these vectors available, nor will their vectors necessarily align well with the phonemes of the BR corpus (which were derived from the orthographic transcription of the Brent corpus).
One child-directed speech corpus that does contain phonemic transcriptions aligned with raw audio is the CAREGIVER corpus (Altosaar et al., Reference Altosaar, ten Bosch, Aimetti, Koniaris, Demuynck and Heuvel2010). However, the utterances are scripted and the type-token ratio is only 0.002, much smaller than the 0.036 for the BR corpus. In initial experiments we found that segmentation models that rely on seeing a variety of phoneme combinations at boundaries struggle, resulting in very different results than when run on actual child-directed utterances.
Future work
Segmentation is a developmental process. English-learning infants display some ability to segment words at 7.5 months of age and do not achieve adult-level performance until close to 24 months (Johnson & Jusczyk, Reference Johnson and Jusczyk2001; Jusczyk, Reference Jusczyk1999). During this time, perceptual tuning also occurs and infants’ sensitivity to the universal set of distributional cues narrows to a native inventory (Liu & Kager, Reference Liu and Kager2017). When we evaluate computational models of word segmentation, it is important to note that we are not equating model performance with infant performance. Instead, we treat such computational models as idealised learners of the distributional phonemic signal, with high performance indicating the utility of cues and learning methods for segmentation, rather than directly predicting infant ability. If we had proper corpora documenting the development of segmentation in infants over time, we could use DYMULTI to create empirical predictions to test against this data.
There are many components of DYMULTI that could be expanded upon in future work. One such component is the representation of utterances. Instead of individual symbols, phonemes could be represented by features, such as the 11 phonetic features used in the model of Christiansen et al. (Reference Christiansen, Allen and Seidenberg1998). Language-specific, distributed representations could also replace phonemes, such as the learned acoustic embedding vectors used in the models of Ma et al. (Reference Ma, Çöltekin and Hinrichs2016) and Kamper et al. (Reference Kamper, Jansen and Goldwater2016) or the probabilistic phone vectors of Rytting et al. (Reference Rytting, Brew and Fosler-Lussier2010).
Furthermore, as suitable corpora become available – such as the release of new ‘day long’ corpora in Homebank (VanDam et al., Reference VanDam, Warlaumont, Bergelson, Cristia, Soderstrom, De Palma and MacWhinney2016), the number and quality of input cues can be improved. Future work could also investigate semantic and multi-modal information that parents may provide their children, such as deictic gestures towards images, joint attention on entities in the environment or iconic gestures to demonstrate object shapes. It is likely that with more cues available, performance would increase, improving our understanding of language acquisition. With sufficient data, DYMULTI could even be used to explore how statistical learning varies between individual children, rather than assuming learned probabilities are similar across an entire population (Siegelman & Frost, Reference Siegelman and Frost2015).
Future work should also seek to bridge the gap between models operating on phonemic transcripts and those operating on raw speech signals. The latter continue to perform significantly worse than the former — the top-performing model for the Zero Speech Challenge (ZRC) series segmentation task achieves a token F1-score of only 19.2 on the English portion of the TDE-17 test corpus, compared to 64.5 for the text-based topline system provided by the task. In an effort to explain this gap in performance, Dunbar et al. (Reference Dunbar, Hamilakis and Dupoux2022) discuss how the higher granularity of analysis, the lack of invariant quantised acoustic representations and the variability of speech rate all contribute. DYMULTI could be run at a higher granularity of analysis, with features extracted directly from the speech stream, to help bridge this gap. Many of the ZRC series models operate on 10ms frames, which are much shorter than the average duration of a phoneme (about 70ms).
We also note that the ZRC series tasks evaluate their models on adult-directed speech corpora, whereas traditional segmentation models have strictly required that the input data be child-directed. Relaxing this requirement would help with the lack of suitable corpora, help to bridge the gap between these two approaches and could lead to new revelations about acquisition. Using her model, Fleck (Reference Fleck2008) explored how infant speech segmentation could be upgraded to adult speech segmentation. She did this by introducing a simple syntactic process to prevent affixes from being segmented away from their stems, achieving WF and LF scores of 80.3 and 41.5 respectively on the Buckeye corpus (Pitt et al., Reference Pitt, Johnson, Hume, Kiesling and Raymond2005). In initial experiments, DYMULTI performs worse than Fleck’s model, with WF and LF scores of 69.9 and 40.6 respectively. Fleck hypothesised that models for infant speech segmentation may be segmenting morphemes, rather than words. Further work into DYMULTI could implement a syntactic process such as Fleck’s to investigate this claim.
Conclusion
In this study, we presented the word segmentation problem and compared two state-of-the-art models for speech segmentation; the PHOCUS models Venkataraman (Reference Venkataraman2001) and Blanchard et al. (Reference Blanchard, Heinz and Golinkoff2010), a language modelling approach to speech segmentation, and the MULTICUE models of Çöltekin & Nerbonne (Reference Çöltekin and Nerbonne2014); Çöltekin (Reference Çöltekin2017), a boundary-finding approach. By re-implementing both models, we found that MULTICUE lacked the ability to consider lexical-level processes and PHOCUS lacked the ability to combine information from several cues. We created a new model, DYMULTI, which overcomes both drawbacks by using the boundary decisions of MULTICUE and converting them to scores that can be passed to the Viterbi algorithm of PHOCUS. In doing so, we achieved state-of-the-art performance on the BR corpus. Evaluating the model cross-lingually, we found that DYMULTI outperformed MULTICUE and PHOCUS on 15 of 26 child-directed speech corpora from different languages, but also that all three models achieved close to their best performance on English, suggesting possible research bias.
DYMULTI represents a flexible framework for exploring hypotheses related to the word segmentation problem, efficiently combining lexical and sub-lexical cues. In this study we have explored predictability, utterance and stress as sub-lexical cues, and a syllabic nucleus constraint and lexical recognition as lexical cues. Such cues could be enhanced with updated knowledge about infant speech cognition to produce a more comprehensive model for speech segmentation. The framework can also be upgraded by considering more nuanced representations of utterances, alternative cue-combination algorithms and other cues for segmentation, once suitable corpora are available.
The impacts of DYMULTI and future research into child speech segmentation are plentiful. Using DYMULTI’s cue combination system, we can better our understanding of which cues are relevant to segmentation, aiding segmentation in speech recognition models. Adult speech segmentation could also be researched using DYMULTI, by examining which cues are relevant by testing on adult-directed speech corpora and adapting DYMULTI with new grammatical processes accordingly. Finally, this research has contributed to the ever-growing understanding of language acquisition. By designing segmentation models that perform well on child-directed speech, we can learn how children first solve this task, thereby improving how we teach language to children in the first place and how language disorders can be mitigated.
Acknowledgements
We thank Çağrı Çöltekin for clarifying the workings of his model. The first author was supported by the Cambridge Trust and the second and third authors were supported by Cambridge Assessment English, University of Cambridge. We also thank the editors and anonymous reviewers for their feedback which helped to greatly improve this article.
Competing interest
The authors declare none.