Introduction
Across human populations, children’s early language experiences vary substantially with respect to who talks to them, what is talked about, and what the children themselves are expected to contribute (e.g., Brown, Reference Brown, Duranti, Ochs and Schieffelin2011; Brown & Gaskins, Reference Brown, Gaskins, Enfield, Kockelman and Sidnell2014; Casillas et al., Reference Casillas, Brown and Levinson2020; de León, Reference de León, Duranti, Ochs and Schieffelin2011; Demuth & Mputhi, Reference Demuth and Mputhi1979; Gaskins, Reference Gaskins, Enfield and Levinson2006; Ochs & Schieffelin, Reference Ochs, Schieffelin, Schweder and LeVine1984; Pye, Reference Pye1986; Rogoff et al., Reference Rogoff, Paradise, Arauz, Correa-Chávez and Angelillo2003; Shneidman & Goldin-Meadow, Reference Shneidman and Goldin-Meadow2012; Vogt et al., Reference Vogt, Mastin and Schots2015). For example, home pedagogical techniques, such as caregiver use of rhetorical questions and directly addressed instructions, are more common in some linguistic contexts than others (e.g., US versus Mayan groups, see e.g., Gaskins, Reference Gaskins, Harkness and Super1996; Rogoff et al., Reference Rogoff, Paradise, Arauz, Correa-Chávez and Angelillo2003; Shneidman et al., Reference Shneidman, Gaskins and Woodward2016).
Research today, primarily revolving around urban, Western contexts, situates child-directed speech (CDS) – more specifically, interactive speech produced by adult caregivers – as fundamental for early language development (e.g., Cartmill et al., Reference Cartmill, Armstrong, Gleitman, Goldin-Meadow, Medina and Trueswell2013; Hoff, Reference Hoff2003; Ramírez-Esparza et al., Reference Ramírez-Esparza, García-Sierra and Kuhl2014, Reference Ramírez-Esparza, García-Sierra and Kuhl2017a, Reference Ramírez-Esparza, García-Sierra and Kuhl2017b). Recent findings converge on the idea that so-called “high quality” (interactive, one-on-one) CDS is a consistent and robust predictor of children’s growing vocabulary (e.g., Ramírez-Esparza et al., Reference Ramírez-Esparza, García-Sierra and Kuhl2014; Rowe, Reference Rowe2008). However, the focus of most research using CDS to predict vocabulary outcomes reflects the political and economic priorities of growing, urban societies – especially their need for a unified and literate workforce. These priorities may not generalize across understudied cultural-linguistic contexts, where other language phenomena (e.g., specific rhetorical practices) may prove more relevant (Ochs & Kremer-Sadlik, Reference Ochs and Kremer-Sadlik2020; Sperry et al., Reference Sperry, Miller and Sperry2015).
Recent cross-linguistic and cross-cultural work on typically developing children supports the idea that there is significant natural variation in children’s exposure to CDS. For example, Shneidman (Reference Shneidman2010; Shneidman & Goldin-Meadow, Reference Shneidman and Goldin-Meadow2012) found an almost ten-fold difference in the proportion of child-directed speech in the linguistic environments of Chicago (US) and Yucatec Mayan (Mexico) children before age three. Scaff, Cristia, and colleagues find that Tsimane’-acquiring children (Bolivia) are directly spoken to infrequently, with recent estimates as low as approximately one half-minute per hour (Cristia et al., Reference Cristia, Ganesh, Casillas and Ganapathy2018; Scaff et al., Reference Scaff, Casillas, Stieglitz and Cristia2023). Relatedly, Casillas and colleagues (Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021) found surprisingly similar, and relatively infrequent rates of directed input in two rural populations with substantially different approaches to child language socialization (Tseltal Mayan (Mexico) and Rossel Island Papuan (Papua New Guinea)). A recurrent theme across much of this work examining CDS in rural and developing populations has been the role of input from other children (e.g., siblings, cousins, and other peers; see also Alam et al., Reference Alam, Ramírez and Migdalek2021; Cristia et al., Reference Cristia, Gautheron and Colleran2023; Loukatou et al., Reference Loukatou, Scaff, Demuth, Cristia and Havron2022). Cristia (Reference Cristia2023) pulls all these findings and more together into a systematic review, highlighting a consistent difference in higher versus lower input rates between urban and rural societies, respectively.
It is not yet understood how differences in CDS exposure play a role in how children process or learn language in their first few years. The emerging evidence on this topic in a cross-linguistic and cross-linguistic context is complex. For example, Ramírez-Esparza et al. (Reference Ramírez-Esparza, García-Sierra and Kuhl2017b) found that CDS heard in a group context (as opposed to one-on-one interactions) was related to vocabulary development in US Spanish-English bilinguals but not monolinguals from the same population. Consistent with this view, studies of populations where caregiver CDS appears relatively rare have found that young children meet language development milestones at roughly the same rate as children growing up in contexts where adult CDS is reported to be very common (Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021), though lexical development may be more sensitive (Ramírez-Esparza et al., Reference Ramírez-Esparza, García-Sierra and Kuhl2017b; Shneidman & Goldin-Meadow, Reference Shneidman and Goldin-Meadow2012).
In short, there is a great deal yet to learn about how language learning is supported by CDS and other sources of input. These other sources may include adult conversations that young children observe (passively or actively), CDS produced by other children, and multimodal and multiparty interactions (Alam et al., Reference Alam, Ramírez and Migdalek2021; Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021; Cristia et al., Reference Cristia, Ganesh, Casillas and Ganapathy2018, Reference Cristia, Gautheron and Colleran2023; de León, Reference de León1998; de León & García-Sánchez, Reference de León and García-Sánchez2021; Hou, Reference Hou2024; Loukatou et al., Reference Loukatou, Scaff, Demuth, Cristia and Havron2022; Scaff et al., Reference Scaff, Casillas, Stieglitz and Cristia2023).
The present study takes a first step toward describing multiple sources of input – not just CDS – across a linguistically and culturally diverse sample of young children. Specifically, we examine how child age and cultural-linguistic group influence the quantities of directly addressed and overhearable adult speech that children encounter in five distinct settings. Before we dive into the methods and findings, we will set up the present study with a brief overview of relevant work on measuring children’s linguistic input: first we define ‘child-directed speech’ and ‘adult-directed speech’ as we use them here; then we review the major factors known to influence the quantities of each input source; finally, we describe prior approaches taken in estimating these input sources from daylong audio recordings.
What counts as “child-directed” input?
A great deal of prior work has contrasted child- and adult-directed speech, but what gets counted as “child-directed” varies from study to study. There are two basic approaches. In the first, these two terms (“CDS” and “ADS”) are used to denote the intended addressee, i.e., child vs. adult. In the second approach, these terms denote the speech register or other characteristics of the speech, regardless of actual addressee. That is, any speech that contains the prosodic, lexical, grammatical, and affective characteristics typically associated with speech to children is classified as child-directed speech, regardless of who was being spoken to. In the present study, we will measure linguistic input quantities based on the first approach: the utterance’s intended addressee (e.g., separating speech exclusively directed to the target child versus to another child versus to an adult, etc.).
While qualitative properties of different input types are also vital to consider when constructing comparative theories of child language development (e.g., Bornstein et al., Reference Bornstein, Tal, Rahn, Galperin, Pecheux, Lamour and Tamis-LeMonda1992; Broesch et al., Reference Broesch, Rochat, Olah, Broesch and Henrich2016; Brown, Reference Brown, Arnon, Casillas, Kurumada and Estigarribia2014; de León & García-Sánchez, Reference de León and García-Sánchez2021; Masek et al., Reference Masek, Ramirez, McMillan, Hirsh-Pasek and Golinkoff2021; Ochs & Schieffelin, Reference Ochs, Schieffelin, Schweder and LeVine1984; Pye, Reference Pye2017), input quantities are ideal for roughly comparing the linguistic material children encounter in their daily lives. Moreover, input quantity estimates that are centered specifically on directed vs. non-directed speech can capture some aspects of input “quality”. Addressees have an advantage in comprehending conversational talk addressed to them over talk addressed to others, precisely because the conversational talk in question is tailored specifically to the addressees’ immediate comprehension (Bell, Reference Bell1984; Foushee et al., Reference Foushee, Srinivasan and Xu2021; Schober & Clark, Reference Schober and Clark1989). Thus, general (and likely universal) mechanisms of human coordination (Clark, Reference Clark1996) predict that child-addressed speech is a referentially clearer linguistic signal for the child learner than adult-directed speech.
In the present study, we compare adult-directed speech (ADS) quantities and target-child directed speech (TCDS) quantities, the latter being speech addressed specifically to the child under study, rather than to another nearby child. These measures represent two qualitatively distinct sources of linguistic input; our present study could thus be described as measuring the quantity of two quality types.
Factors shaping input quantity
A broad spectrum of factors has been suggested to influence the quantity of CDS children encounter in their daily lives. Much less work has investigated factors influencing the quantity of ADS children encounter. We briefly summarize the primary factors examined in prior work, from the macro scale to the micro scale. These factors inform the present study’s analyses.
On the macro end of the spectrum, CDS quantities are thought to be influenced through group membership – for example, via socioeconomic group membership or via culturally held beliefs and practices around child rearing (e.g., for a recent review, see Rowe & Weisleder, Reference Rowe and Weisleder2020; for reviews regarding language socialization and culture, see Gaskins, Reference Gaskins, Enfield and Levinson2006; Ochs & Schieffelin, Reference Ochs, Schieffelin, Schweder and LeVine1984). For example, regarding socioeconomic group, meta-analyses of nearby adult talk in daylong audio data (Piot et al., Reference Piot, Havron and Cristia2022) and CDS in naturalistic, unstructured interaction data (Dailey & Bergelson, Reference Dailey and Bergelson2022) suggest a small but significant positive correlation of linguistic input quantity with socioeconomic status (but cf. Bergelson et al., Reference Bergelson, Soderstrom, Schwarz, Rowland, Ramirez-Esparza, Hamrick, Marklund, Kalashnikova, Guez, Casillas, Benetti, Alphen and Cristia2023).
Regarding cultural group, some prior work found no evidence for differences in baseline TCDS rate between Tseltal- and Yélî Dnye-speaking children under age three, despite clear ethnographic evidence that adults in these two communities take very different approaches to talking to infants and young children (“non-child-centric” vs. “child-centric” input environments; Brown, Reference Brown, Duranti, Ochs and Schieffelin2011, Reference Brown, Arnon, Casillas, Kurumada and Estigarribia2014; Brown & Casillas, Reference Brown, Casillas, Fentiman and Goody2025; Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021). In contrast, Shneidman and Goldin-Meadow (Reference Shneidman and Goldin-Meadow2012) did observe clear differences in US and Yucatec Mayan children’s input quantities, with the US children under age three hearing significantly more directed input. This evidence concords with Cristia’s (Reference Cristia2023) characterization of the primary split in input quantities being in rural versus urban populations, rather than differences between individual cultural groups. Complementing this work on input quantity, studies of input quality consistently show clear cross-cultural variability in how often children are talked to, by whom, and what is talked about (e.g., de León, Reference de León, Duranti, Ochs and Schieffelin2011; Demuth & Mputhi, Reference Demuth and Mputhi1979; Gaskins, Reference Gaskins, Enfield and Levinson2006; Ochs & Schieffelin, Reference Ochs, Schieffelin, Schweder and LeVine1984; Pye, Reference Pye1986; Rogoff et al., Reference Rogoff, Paradise, Arauz, Correa-Chávez and Angelillo2003; Rosemberg et al., Reference Rosemberg, Alam, Audisio, Ramirez, Garber and Migdalek2020, Reference Rosemberg, Alam, Ramirez and Ibañez2023; Stein et al., Reference Stein, Menti and Rosemberg2021; Vogt et al., Reference Vogt, Mastin and Schots2015).
In the meso part of the spectrum, children’s age and available interactants may also shape input quantities. Regarding age, prior work does not consistently demonstrate evidence of change in CDS quantity with child age but does demonstrate age-related change for other input sources, including ADS and non-canonical CDS (Bergelson et al., Reference Bergelson, Casillas, Soderstrom, Seidl, Warlaumont and Amatuni2019b; Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021; Ramírez-Esparza et al., Reference Ramírez-Esparza, García-Sierra and Kuhl2017b; see also Shneidman & Goldin-Meadow, Reference Shneidman and Goldin-Meadow2012). Regarding available interactants, prior work points to a greater availability of CDS from adults compared to children – and, among adults, from women compared to men (e.g., Bergelson et al., Reference Bergelson, Casillas, Soderstrom, Seidl, Warlaumont and Amatuni2019b; Shneidman & Goldin-Meadow, Reference Shneidman and Goldin-Meadow2012). As noted above, however, a recurrent theme in work on rural populations is the presence of other children and hence the high prevalence of peer-produced CDS (Alam et al., Reference Alam, Ramírez and Migdalek2021; Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021; Cristia et al., Reference Cristia, Gautheron and Colleran2023; Loukatou et al., Reference Loukatou, Scaff, Demuth, Cristia and Havron2022; Scaff et al., Reference Scaff, Casillas, Stieglitz and Cristia2023).
Lastly, on the micro end of the spectrum, CDS and ADS rates fluctuate moment to moment given factors such as the ongoing activity (e.g., playing or eating), the number of potential interactants present, the physical condition of the target child and their surrounding family (e.g., sleeping/awake, stationary/in motion), and more. Soderstrom and colleagues (Soderstrom et al., Reference Soderstrom, Grauer, Dufault and McDivitt2018; Soderstrom & Wittebolle, Reference Soderstrom and Wittebolle2013) found that linguistic input rates systematically varied depending on the activity context and number of adults present in Canadian daylong recordings (see also Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021; Greenwood et al., Reference Greenwood, Thiemann-Bourque, Walker, Buzhardt and Gilkerson2011; Rosemberg et al., Reference Rosemberg, Alam, Audisio, Ramirez, Garber and Migdalek2020, Reference Rosemberg, Alam, Ramirez and Ibañez2023). Although we will not have information about activity context in the present work, we will at least be able to account for the number of individual talkers present. When there are more talkers there is more talk. That is, the presence of each additional talker increases competition for the conversational floor, and when four or more talkers are present, group conversations often split into smaller, simultaneous conversations, multiplying the amount of observable talk (conversational “schism” see, e.g., Holler et al., Reference Holler, Alday, Decuyper, Geiger, Kendrick and Meyer2021; Sacks et al., Reference Sacks, Schegloff and Jefferson1978). Minimally, the number of talkers present can be considered a nuisance variable to help explain fluctuations in CDS and ADS rate over the day. More informatively, however, the number of talkers may serve as a proxy for interactional contexts that involve denser family participation (e.g., in overlapping, co-present subgroups) versus contexts where smaller groups of individuals are on their own.Footnote 1
Extracting CDS and ADS from daylong recordings
Ecologically valid estimates of speech input rates are now possible via long-format (e.g., daylong) recordings of children’s home language environments (e.g., LENA, Greenwood et al., Reference Greenwood, Thiemann-Bourque, Walker, Buzhardt and Gilkerson2011; Bergelson et al., Reference Bergelson, Amatuni, Dailey, Koorathota and Tor2019a; see Pisani et al., Reference Pisani, Gautheron and Cristia2021 for a review). However, to date these recording systems cannot reliably and automatically differentiate between CDS and ADS across a variety of recording settings (for a promising start, see Bang et al., Reference Bang, Kachergis, Weisleder, Marchman, Gong and Kpogo2022). Studies that have leveraged daylong recordings have therefore relied on manual annotation to supplement any automated output, taking several different approaches. For example, Weisleder and Fernald (Reference Weisleder and Fernald2013) manually classified 5-min blocks of time as primarily child-directed or adult-directed, while Ramírez-Esparza and colleagues (Ramírez-Esparza et al., Reference Ramírez-Esparza, García-Sierra and Kuhl2014, Reference Ramírez-Esparza, García-Sierra and Kuhl2017a, Reference Ramírez-Esparza, García-Sierra and Kuhl2017b) manually annotated speech-dense clips of audio as having: (1) speech addressed to the child; (2) speech containing the parentese register features of CDS versus ADS (independent of addressee); and (3) who was present as a conversational partner. Moving from the audio-clip level to the utterance level, Bergelson et al. (Reference Bergelson, Casillas, Soderstrom, Seidl, Warlaumont and Amatuni2019b) extracted individual utterances using LENA’s automated utterance annotations and then annotated them as child- or adult-directed, based on recognizable CDS and ADS register characteristics.
While these studies examine CDS and ADS in large and highly naturalistic datasets, they either take a very coarse perspective (e.g., examining 5-minute intervals), or tell us about input patterns during the day’s interactional peaks rather than illustrating patterns in children’s average language experiences over the course of a day. In order to extract a representative measure of linguistic input, i.e., how much language children encounter from different types of people in different types of interactional contexts across their day (including typical “down” time), we must take random or periodic samples of the language environment (Casillas & Cristia, Reference Casillas and Cristia2019; see also Alam et al., Reference Alam, Ramírez and Migdalek2021; Rosemberg et al., Reference Rosemberg, Alam, Audisio, Ramirez, Garber and Migdalek2020, Reference Rosemberg, Alam, Ramirez and Ibañez2023; Stein et al., Reference Stein, Menti and Rosemberg2021) rather than only analyzing interactional peaks or estimating across time periods. To gather accurate and representative estimates of natural, at-home CDS and ADS in the present study, we therefore randomly sampled clips from daylong audio recordings and fully transcribed all hearable speech, annotating intended addressee for each utterance in each clip.Footnote 2
The current work
We examine baseline rates of target-child-directed speech (TCDS) and adult-directed speech (ADS) in the daylong recordings of children growing up in five culturally and linguistically distinct groups: North American English (“NA English”; US & Canadian), United Kingdom English (“UK English”; England), Argentinian Spanish (“Arg. Spanish”; Argentina), Tseltal (Tenejapa, Mayan, Mexico), and Yélî Dnye (Rossel Island, Papuan, Papua New Guinea). As detailed below, some of these corpora include samples from multiple, distinct sub-populations (e.g., NA English includes both US and Canadian English), so we hereafter refer to each of these samples as “language groups” rather than “languages”. This unique metacorpus draws on seven pre-existing collections of daylong recordings (“corpora”) that were gathered by different research teams, with a variety of different recording devices (i.e., not all LENA), and for a range of different research purposes.Footnote 3 Our primary objective was to quantitatively measure the exposure of young children in these groups to two different sources of linguistic input – TCDS and ADS – and to examine several factors associated with variation in this exposure: age, language group, talker type, and number of talkers present.
To accomplish this goal, we defined a second, critical objective: to generate an audio sampling and annotation approach that could be fruitfully employed across recordings made in culturally and linguistically diverse populations (Soderstrom et al., Reference Soderstrom, Casillas, Bergelson, Rosemberg, Warlaumont and Bunce2021). As motivated above, our analyses focus on two distinct types of linguistic input: TCDS and ADS. Our annotation scheme additionally allows us to examine other types of input, e.g., OCDS (other-child-directed speech, i.e., speech directed to children other than the target child; see Figure 1). For the sake of simplicity, we report data for OCDS in the Supplementary Materials Section 1, where we combine it with TCDS to generate parallel analyses of all-CDS (TCDS + OCDS), with similar results to what is reported below.
Exploratory hypotheses
Following from the findings summarized above, the specific aims of our analysis were to examine how TCDS and ADS varied across age, language group, talker type, and number of talkers in a highly naturalistic and culturally varied set of daylong recordings. To the prior literature, the present work adds an apples-to-apples comparative view on these effects, given that each of the included corpora recorded, sampled, and annotated children’s input in highly similar ways (Soderstrom et al., Reference Soderstrom, Casillas, Bergelson, Rosemberg, Warlaumont and Bunce2021). Before analysis, we established a specific set of exploratory hypotheses – with corresponding regression formulae – regarding TCDS and ADS. We term them “exploratory” here because each corpus includes data from only 9–10 children, which is large relative to prior comparable work, but still small overall (Table 1). These hypotheses were slightly different for TCDS and ADS, based on findings from prior work (see Tables 2 and 3 for detailed overviews).
Parentheses following the mean indicate the range across participants.
The ‘Supported’ column reflects the extent to which each current finding aligns with its predicted outcome.
The ‘Supported’ column reflects the extent to which each current finding aligns with its predicted outcome.
Target-child-directed speech hypotheses
Based on the work cited above, we expected TCDS rate to vary across language groups (e.g., to be higher in more urban groups) and to come most often from women, but with a greater presence of other-child-produced TCDS in some groups (Tseltal, Yélî Dnye, Argentina). We did not expect any effects of child age on TCDS rate. We expected that the TCDS rate would be higher when more talkers were present, given the idea that more talkers produce more talk.
Adult-directed speech hypotheses
We expected for ADS rate to vary across language groups (e.g., to be lower in more urban contexts), to decrease significantly with child age (especially in groups with high ADS rates early on, e.g., Yélî Dnye), to come most often from women, but with greater contributions from children in some groups (Tseltal, Yélî Dnye, Argentina). We also expected that the ADS rate would be higher when more talkers were present.
We limited our hypotheses to simple effects and two-way interactions. We might anticipate other, more complex effects (e.g., the three-way interaction of age-language group-talker type on TCDS rate), but given the limited size of our metacorpus (N = 10 recordings per corpus maximum) we leave these effects to be tested in future, larger datasets.
The present paper is the first to bring together all these different factors known to influence TCDS and ADS and to examine their joint effects across multiple language groups (see also Bergelson et al. (Reference Bergelson, Soderstrom, Schwarz, Rowland, Ramirez-Esparza, Hamrick, Marklund, Kalashnikova, Guez, Casillas, Benetti, Alphen and Cristia2023) for a LENA-based, comparative mega-analysis of nearby adult input). We examine these factors in order to identify axes of consistency and variation across the multi-corpus sample. While similar analyses have been previously conducted on two individual corpora (Tseltal and Yélî Dnye; Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021), the current study offers a first view into baseline TCDS and ADS rates at home in North American English, UK English, and Argentinian Spanish while simultaneously providing a comparative perspective on how each of the proposed factors – group, child age, and interactants – influences children’s input rates.
To reiterate, we do not attempt to examine all possible interactions, and instead take a hypothesis-driven approach to analysis.Footnote 4 Importantly, while we identify key points of theoretically relevant variation across the samples in this study, we do not argue that these language groups represent the full spectrum of diversity in linguistic input experiences, even within the specific populations we have sampled.
Methods
Metacorpus construction
We use the Analyzing Child Language Experiences around the World (ACLEW) metacorpus (Soderstrom et al., Reference Soderstrom, Casillas, Bergelson, Rosemberg, Warlaumont and Bunce2021) of long-form audio recordings of children’s everyday language environments, comprising seven corpora from five culturally and linguistically distinct groups, labeled here as: North American English (NA English), United Kingdom English (UK English), Argentinian Spanish (Arg. Spanish), Tseltal, and Yélî Dnye.Footnote 5 Each group is represented by a single corpus except North American English, for which we had access to three corpora. Recordings for each corpus were originally collected for the unique research purposes of the individual lab contributing the corpus, and therefore there is variation across corpora in the recruitment practices, recording equipment (i.e., not all LENA), recording duration, target child ages (see Supplementary Materials Section 3), and other demographic characteristics (see Table 1 for an overview).
Sampling technique
Our sampling and annotation scheme needed to be suitable for daylong recordings of different durations made with different recording devices, and for variable annotation situations (e.g., in a lab or in the field).
We selected a single day’s recording for 10 children from each corpus, except the McDivitt-Winnipeg corpus from which we selected 9 recordings due to a sampling error (total recordings N = 69); this sample size per corpus reflects what was possible with the smallest corpora in our sample. We used a script to select recordings that were as balanced within and across corpora in reported child gender (male/female), maternal education (below high school–advanced degree), and child age (0;2–3;0; see https://osf.io/pysth/ for details). The range of available ages was more limited in North American English compared to the other corpora but our statistical approach accounts for this (also see the Supplementary Materials Section 3). Five of the included recordings overlap with those used in Bergelson et al. (Reference Bergelson, Casillas, Soderstrom, Seidl, Warlaumont and Amatuni2019b) and the same Tseltal and Yélî Dnye annotations have been analyzed somewhat differently in separate work (Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021).
Each dataset and contributing lab came with a specific set of constraints on what was possible for manual annotation work (e.g., teams of undergraduate students versus individual collaborations with native speakers in remote field sites), so we settled on two basic techniques for sub-sampling and transcribing data from these long-format recordings. These methods for sampling and preparing clips for annotation are illustrated in Figure 1.
For North American English, UK English, and Argentinian Spanish (49 of the 69 recordings), we wrote a Python script to randomly pick start times for 15 two-minute clips from throughout the day of each recording, excluding any possibility of clip overlap. The script selected the start and stop times of each clip, as well as the start and stop times of an associated three-minute context period for each clip (see Figure 1, upper left). Thus each of the 15 clips per recording contained one minute of prior context, followed by two minutes of audio to be transcribed and annotated, followed by two more minutes of additional context. The start and stop times of the context and to-be-transcribed clips were then added automatically to a single ELAN (Wittenburg et al., Reference Wittenburg, Brugman, Russel, Klassmann and Sloetjes2006) audio annotation file that spanned the entire recording. This process resulted in 30 total minutes of annotation per recording.
The Tseltal and Yélî Dnye corpora (20 of the 69 recordings) used a similar method, except only 9 clips were randomly selected. However, the clips were longer than in the other corpora. Tseltal clips were 5 minutes long and Yélî Dnye clips were 2.5 minutes long, resulting in a total of 45 minutes and 22.5 minutes of annotation per recording for the Tseltal and Yélî Dnye corpora, respectively. The five-minute clips in Tseltal had no additional context; this length of clip already provides significant context. The 2.5-minute clips for Yélî Dnye were followed by an additional 2.5 minutes of recording context. Thus, the total context review period for annotation clips across all corpora was five minutes (Figure 1).
Minor deviations in the sampling process between corpora are not expected to have meaningful effects on the analyses: all clips are short and randomly selected from throughout the child’s waking day. These deviations arose because the Tseltal and Yélî Dnye datasets required significant contributions from native local speakers in each remote community sampled, and so the annotation workflow was adapted to suit the associated researcher’s fieldwork schedule.
The final clip collection therefore consists of 35.8 hours of transcribed and annotated recording time, of which 16.3 hours consists of communicative vocalizations. Given the constraints across corpora on transcription work hours, this was near the ceiling of manual annotation data we could generate. It was unknown in advance how many recording minutes would be needed to produce meaningful results. That said, Casillas et al. (Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021) found that the present amount and distribution of recording minutes were sufficient to detect many of the effects predicted here. Their findings are especially promising for the current set of analyses, which includes a similar statistical approach and re-uses those two datasets (now with additional corpora for comparison). Recent studies (Cychosz et al., Reference Cychosz, Villanueva and Weisleder2020; Marasli & Montag, Reference Marasli, Montag, Goldwater, Anggoro, Hayes and Ong2023; Micheletti et al., Reference Micheletti, de Barbaro, Fellows, Hixon, Slatcher and Pennebaker2020) have started building up a more general approach to sampling naturalistic behavior from daylong recordings, but a lack of prior knowledge about the distribution of different input densities from different types of talkers across these groups prevented us from being able to confidently peg our sampling technique to anticipated underlying effects. To counteract what we anticipated would be limited statistical power, we planned to only analyze effects for which we had strong a priori predictions (see an overview in Tables 2 and 3).
Annotation technique
Each of the randomly selected segments was annotated using the ACLEW Annotation Scheme (https://osf.io/b2jep/, Casillas et al., Reference Casillas, Bergelson, Warlaumont, Cristia, Soderstrom, VanDam, Sloetjes, Lacerda, House, Heldner, Gustafson, Strömbergsson and Włodarczak2017a; Soderstrom et al., Reference Soderstrom, Casillas, Bergelson, Rosemberg, Warlaumont and Bunce2021), an ELAN-based approach (Wittenburg et al., Reference Wittenburg, Brugman, Russel, Klassmann and Sloetjes2006). Each annotator undergoes a rigorous and independent training and testing process to ensure intra- and inter-lab consistency in coding. Annotators segmented and transcribed all hearable human communicative vocalizations in the samples, with a separate tier for each individual talker to allow for overlapping talk. Each tier was identified by the talker’s perceived age and gender category (adult/child/unknown and female/male/unknown; e.g., FA1 = female adult 1 in Figure 1). All utterances (except the target child’s) were also annotated for the intended addressee in seven categories – exclusively target child, non-target child, adult, mixed-age, animal, other, unknown – on the basis of any available contextual and interactional information within the audio recordings.
Annotator reliability was checked by the complete re-annotation of one-minute from each recording by a new annotator. We then compared the original minute’s annotations to the re-coded minutes’ annotations. A full reliability report is available at https://osf.io/pysth/, but to briefly summarize, error estimates for talker type annotations (e.g., disagreements about whether the talker is the target child or a different child) are far better than prior work has found between human and LENA (i.e., automated) annotations. Further, comprehensive kappa scores reflect moderate-to-substantial agreement (cross-corpus k range = 0.55–0.68) for talker types and slight-to-substantial agreement (cross-corpus k range = 0.32–0.64) for addressee, with wide variability in agreement between corpora. Despite CDS having some cross-linguistically recognizable features (e.g., Bornstein et al., Reference Bornstein, Tal, Rahn, Galperin, Pecheux, Lamour and Tamis-LeMonda1992; Fernald et al., Reference Fernald, Taeschner, Dunn, Papousek, de Boysson-Bardies and Fukui1989; Hilton et al., Reference Hilton, Moser, Bertolo, Lee-Rubin, Amir, Bainbridge, Simson, Knox, Glowacki, Alemu, Galbarczyk, Jasienska, Ross, Neff, Martin, Cirelli, Trehub, Song, Kim, Schachner and Mehr2022), we had expected somewhat lower reliability scores for addressee annotations because the reliability annotators were not always native speakers of the language of the file they were annotating; their annotation decisions were thus less informed by lexicosyntactic content than the (native-speaking) original annotators’. Most cases of disagreement arose when one annotator indicated silence or overlapping talk where the other annotator indicated talk from a single person – confusion between actual addressee categories was relatively low (see Supplementary Materials Section 8 for more details).
Data analysis
All statistical analyses were conducted in R with the glmmTMB package (Brooks et al., Reference Brooks, Kristensen, van Benthem, Magnusson, Berg, Nielsen, Skaug, Maechler and Bolker2017; R Core Team, Reference Team2019) and all figures were generated with ggplot2 (Wickham, Reference Wickham2016). Analysis scripts and raw anonymized data are available at https://osf.io/pysth/. Our two dependent measures were the rates of TCDS and ADS (both expressed in minutes per hour). We calculated TCDS and ADS input rate for each clip for each of three talker types: female adults (here “women”), male adults (here “men”), and children (here “children”, including both male and female children). All other utterances (e.g., language addressed to animals and language produced by electronic devices) were excluded. As motivated above, we designate TCDS versus ADS utterances based on who they were perceivably addressed to: ‘TCDS’ includes communicative utterances that were addressed exclusively to the target child (from an adult or another child). ‘ADS’ includes communicative utterances addressed to one or more adults (from an adult or from another child).
TCDS and ADS input rate cannot be negative. In practice, they are modally zero or close to zero across clips. Given our random sampling technique, which can include periods of silence, many clips include no TCDS or ADS. These “down” times for input are part of the representative pattern of children’s language experienceFootnote 6 but also present an analytical challenge: observed cases of 0 TCDS/ADS in many clips combined with a skewed non-negative distribution of > 0 TCDS/ADS in other clips. This distribution of TCDS/ADS across sampled clips cannot be modeled with the assumption of normality. We therefore used zero-inflated negative binomial mixed-effects regressions for our analyses. This regression type uses a two-model approach to overcome non-negative, overdispersed data with extra cases of zero – the case for the present data (Brooks et al., Reference Brooks, Kristensen, van Benthem, Magnusson, Berg, Nielsen, Skaug, Maechler and Bolker2017; Smithson & Merkle, Reference Smithson and Merkle2013).
The two model components constructed for the analyses of TCDS and ADS are: (1) a zero-inflation model (indicated by “ziformula” in the model formulae), which uses a logistic regression to model the likelihood of the presence of ‘zero’ cases in the data (e.g., answering questions like ‘are zero-TCDS clips less likely for older target children?’) and (2) a count model, which uses linear regression to model how the non-zero rate of TCDS/ADS is influenced by the predictors of interest (e.g., answering questions like ‘is TCDS rate higher for older target children?’). The a priori predictions we laid out above can be applied to both model components, as shown in Tables 2 and 3.
The simple effects included in the models were target child age (centered and standardized from age in months), number of talkers present in that clip (centered and standardized from the unique number of talkers across all clips), talker type (woman versus man/child), and language sample (North American English versus UK English/Argentinian Spanish/Tseltal/Yélî Dnye). We only included interactions for which we had a strong a priori hypothesis and thus the models for TCDS and ADS differ slightly in their structure (see the Results for the regression formulae).
We modeled language group and talker type as dummy-coded factorial variables, which limited our ability to make comparisons among language groups; e.g., if Tseltal were the reference level, the model outcomes for language group would give pairwise comparisons between Tseltal and all the other language groups, but not pairwise comparisons between other language groups, for example, between Argentinian Spanish and UK English. We selected ‘North American English’ and ‘women’ as our default reference levels for reporting model estimates below, given that North American English and linguistic input from female adults are the most well represented in (a) the current dataset and (b) prior work done on these populations. In addition, we were interested in establishing under-studied patterns that may be present in our dataset – effects that diverge from groups that are currently over-represented in the literature. Setting these levels as a reference gives us a first glimpse into the variation that has gone under-examined in past work. This analysis should not be understood as positioning North American English as a global default for understanding development.
That said, pairwise comparisons between language groups may also be of interest to readers. For those curious about how the reported effects below are impacted by the selected reference level of language group, we include versions of our models with each language group as the reference level in the Supplementary Materials (i.e., four additional versions of the TCDS and ADS model each; Section 6). Here in the main text our results focus on models of TCDS and ADS with North American English as the reference level for language group, and women as the reference level for talker type.
Results
Descriptive statistics for observed TCDS and ADS rates by language group and talker type are shown in Table 4 and in Figure 2. A visual summary of statistical model outcomes from the count models of TCDS and ADS rate is shown in Figure 3. Further, marginal mean plots of model-predicted TCDS and ADS rates across age, language group, and talker type are available in Supplementary Materials Section 7. In Tables 2 and 3 we provide a high-level summary of which hypothesized outcomes were statistically supported in the regressions described below.
Parentheses following the mean indicate the median and range across participants in each group.
As a reminder, we report results from the models of TCDS and ADS with North American English as the reference level for language group and women as the reference group for talker type. Identical models with the full range of alternate reference levels for language group are available in the Supplementary Materials (Section 6). Unless otherwise noted, the significant effects reported below are qualitatively similar (i.e., significant in the same direction) in all alternate models.
Target-child-directed speech
On average, across all recordings, children were exposed to 3.66 minutes of TCDS per hour (median = 3.24), with substantial individual variation between children (range = 0–10.12). Our model of TCDS rate included target child age (numeric; standardized), talker type (factorial; woman/man/child), the number of talkers present in the clip (numeric; standardized), and language group (factorial; NA English/UK English/Argentinian Spanish/Tseltal/Yélî Dnye), with two additional two-way interactions (talker type by language group and child age by talker type) and random intercepts by child. The zero-inflation model component included child age and language group as predictors (N = 2745 clips, log-likelihood = −2,703.72, overdispersion estimate = 8.94; formula = TCDS.min.p.hr ~ child.age + talker.type + num.tlkrs.in.clip + lang.grp + talker.type:lang.grp + child.age:talker.type + (1 | child.id), ziformula = ~ child.age + lang.grp).
Effects of child age, talker type, and number of talkers present
As predicted, we found no evidence that TCDS changed with age (B = −0.03, SE = 0.09, z = −0.31, p = 0.76). TCDS rate was significantly lower for men compared to women (B = −2.03, SE = 0.19, z = −10.69, p < 0.001) and for children compared to women (B = −3.54, SE = 0.37, z = −9.64, p < 0.001). TCDS rate was also significantly higher when there were more talkers present (B = 0.33, SE = 0.04, z = 7.62, p < 0.001).
Effects relating to language group
The baseline rate of TCDS input in North American English was estimated to be significantly higher than Yélî Dnye (B = −0.95, SE = 0.32, z = −2.97, p < 0.01), with no evidence for difference in baseline TCDS rate between North American English and any other language group (all p’s ≥ 0.58). TCDS input rate from men varied between language groups: compared to North American English, the TCDS rate from men was significantly higher in both Argentinian Spanish (B = 0.70, SE = 0.31, z = 2.29, p = 0.02) and Yélî Dnye (B = 0.75, SE = 0.37, z = 2.02, p = 0.04). Similarly, TCDS from children varied between language groups: compared to North American English, TCDS rates from children were significantly higher in all other language groups (UK English: B = 1.10, SE = 0.51, z = 2.16, p = 0.03; Argentinian Spanish: B = 1.58, SE = 0.45, z = 3.48, p < 0.001; Tseltal: B = 1.91, SE = 0.49, z = 3.91, p < 0.001; Yélî Dnye: B = 2.81, SE = 0.46, z = 6.11, p < 0.001).
Interaction between child age and talker type
We found no evidence that TCDS from men changed with age relative to TCDS from women (B = −0.13, SE = 0.13, z = −1.01, p = 0.31). In contrast, TCDS from children increased with age relative to TCDS from women (B = 0.29, SE = 0.12, z = 2.35, p = 0.02).
The zero-inflation regression component did not suggest any additional evidence for effects of child age or language group (North American English versus other groups) on the likelihood of a clip containing zero TCDS (all p’s ≥ 0.27).
Adult-directed speech
On average, across all recordings, children were exposed to 10.08 minutes of ADS per hour (median = 8.34), again with considerable variation between children (range = 0–38.54). Our model of ADS rate included target child age (numeric; standardized), talker type (factorial; woman/man/child), number of talkers in the clip (numeric; standardized), and language group (factorial; NA English/UK English/Argentinian Spanish/Tseltal/Yélî Dnye), with two additional two-way interactions (talker type by language group and child age by language group) and random intercepts by child. The zero-inflation model component only included language group; we had planned to also include child age in the zero-inflation component, but its inclusion led to model non-convergence issues. Child age remained a predictor in the count model (N = 2745 clips, log-likelihood = −4,190.69, overdispersion estimate = 15.69; formula = ADS.min.p.hr ~ child.age + talker.type + num.tlkrs.in.clip + lang.grp + talker.type:lang.grp + child.age:lang.grp + (1 | child.id), ziformula = ~ lang.grp).
Effects of child age, talker type, and number of talkers present
ADS rate decreased significantly with age (B = −0.31, SE = 0.13, z = −2.45, p = 0.01), but this effect was non-significant in some alternate models with other reference levels (see Supplementary Materials Section 6) and so should be considered preliminary. We note that, across all alternate models, the estimate for an effect of age remained numerically negative. ADS rate was also significantly lower for men compared to women (B = −1.00, SE = 0.13, z = −7.77, p < 0.001) and for children compared to women (B = −0.89, SE = 0.12, z = −7.16, p < 0.001). This result, suggesting a difference between children and women, depends on which language group is chosen for the reference level: it is non-significant, though still numerically negative, when UK English is set as the reference level (see Supplementary Materials Section 6). As with TCDS, ADS was significantly higher when there were more talkers present (B = 0.71, SE = 0.03, z = 21.54, p < 0.001).
Effects relating to language group
There was no evidence for differences between baseline ADS input rate in North American English and any other language group (all p’s ≥ 0.22). There was also no evidence that the difference in women’s and men’s ADS input rates varied between North American English and any other language group (all p’s ≥ 0.14). In contrast, the difference in women’s and children’s ADS input rates was significantly smaller in UK English compared to North American English (B = 0.60, SE = 0.26, z = 2.31, p = 0.02). There was also no evidence that age-related change in ADS input rates varied between North American English and any other language group (all p’s ≥ 0.11).
The zero-inflation regression component did not suggest any additional evidence for effects of child age or language group (North American English versus other groups) on the likelihood of a clip containing zero ADS (all p’s ≥ 0.99).
Readers interested in exploring further pairwise comparisons of TCDS and ADS effects between language groups (e.g., Tseltal versus UK English) are encouraged to view alternate versions of the models of TCDS and ADS in the Supplementary Materials (Section 6).
Discussion
We examined how two input sources, TCDS and ADS, vary in children’s early language environments, depending on child age, talker type, language group, and number of talkers present. Our data come from a metacorpus of 69 daylong recordings from children under three in five culturally and linguistically distinct groups. The present paper is the first to examine the joint effects of these factors across multiple language groups, shedding light on typical patterns in children’s early language experiences across these different contexts. This project also presented a successful model for sampling and annotating child language data in a unified manner across different labs. In this discussion we highlight four major findings: (1) minimal effects of age; (2) women’s input predominates, men’s is rare, and children’s varies between language groups; (3) more talkers leads to more talk; and (4) minimal evidence for baseline differences in TCDS and ADS input rates between language groups. While many of the predictions we made initially were supported, some were not (Tables 2 and 3). In what follows, we briefly discuss each of the four major findings highlighted, raising the most relevant implications of each.
Minimal effects of age
TCDS rate showed no significant change across this developmental period (0;0–3;0) while ADS rate significantly decreased with target child age. The result replicates prior findings on daylong TCDS and ADS in a subset of these groups (Bergelson, Reference Bergelson2020; Bergelson et al., Reference Bergelson, Casillas, Soderstrom, Seidl, Warlaumont and Amatuni2019b; Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021). However, this significant effect of target child age on ADS rate should be taken as preliminary, given that alternate reference-level models do not always show this effect. The lack of evidence for an increase in TCDS rate with age, consistent with our predictions, may appear inconsistent with the findings reported in Ramírez-Esparza et al.’s (Reference Ramírez-Esparza, García-Sierra and Kuhl2017a) study. The substantial differences in their and our constructs (i.e., “parentese” register versus TCDS) and measurement approach (i.e., clip-by-clip classification versus utterance-based transcript analysis) unfortunately prevent direct comparison between these two studies, but future work may examine both approaches within the same corpus to get a more comprehensive view of how child age impacts the quantities of different input sources.
It is not yet clear what would lead to this decrease in ADS with target child age. One existing proposal is that children become independently able to wander away from adult conversation as they gain mobility and independence (Bergelson et al., Reference Bergelson, Casillas, Soderstrom, Seidl, Warlaumont and Amatuni2019b). This proposal is consistent with our main result, but confirming it would require information beyond the current recordings, and we would moreover need to explain why the decrease in ADS is sensitive to which language group is selected as the reference.
The lack of evidence for an increase in TCDS during this early period, when we know that children experience immense growth in their linguistic knowledge and processing capacity, aligns with recent work reasoning that growth in early linguistic skills reflects children’s changing efficiency and sophistication in extracting relevant information from their ambient linguistic environments, as opposed to direct changes to their linguistic input (see Bergelson, Reference Bergelson2020 for a review). Rather than attributing development to changes in the input, this theoretical approach looks instead to growth in children’s ability to engage in real-time language prediction and use of already acquired world and symbolic language knowledge (Bergelson, Reference Bergelson2020; Meylan & Bergelson, Reference Meylan and Bergelson2022; Snedeker et al., Reference Snedeker, Geren and Shafto2007). To this account, the current findings add a preliminary but important cross-linguistic datapoint: this basic idea may hold across diverse linguistic contexts.
Women’s input predominates, men’s is rare, and children’s varies between language groups
Regarding the talkers producing children’s input, we found that women predominate in children’s language environments. The prevalence of woman-produced language over man- and child-produced language was evident for both TCDS and ADS. However, the extent to which women’s input predominates – especially for TCDS – varied. The rate of TCDS produced by men was significantly higher in Argentinian Spanish and Yélî Dnye compared to North American English. The rate of TCDS produced by children was significantly higher in all language groups compared to North American English, and TCDS produced by children increased more with age relative to TCDS produced by women. In contrast, we found very little evidence for change in the rates of ADS from different talkers across age or language group: only UK English showed a significant difference from North American English, with a significantly smaller gap between children and women’s ADS rates. We are cautious in interpreting this lack of evidence for age-related change within speaker types given our limited sample size.
One implication of these findings is that, across these different language groups, women’s input plays an outsize role in their children’s input, both in terms of directed and observable language. While there was very clear cross-linguistic variation in the contribution of different talker types, this central role for woman-produced linguistic input was clear across our dataset. We are far from the first researchers to make this observation for child language input (see, e.g., Bateson, Reference Bateson and Bullowa1979; Bergelson et al., Reference Bergelson, Casillas, Soderstrom, Seidl, Warlaumont and Amatuni2019b; Bruner, Reference Bruner1983; Cooper & Aslin, Reference Cooper and Aslin1989; Mannle et al., Reference Mannle, Barton and Tomasello1992), and talker-specific effects on early linguistic representations have been demonstrated previously in experimental tests of implicit language knowledge (e.g., Bergelson & Swingley, Reference Bergelson and Swingley2018; Hillairet de Boisferon et al., Reference Hillairet de Boisferon, Dupierrix, Quinn, Lœvenbruck, Lewkowicz, Lee and Pascalis2015; Houston & Jusczyk, Reference Houston and Jusczyk2000; Martin et al., Reference Martin, Schatz, Versteegh, Miyazawa, Mazuka, Dupoux and Cristia2015). However, our findings underscore how cross-linguistically pervasive these effects may be, urging further work on the talker-specific properties of infants’ early linguistic representations and the mechanisms by which these early representations become more robust to different talker types over time.
More talkers leads to more talk
As predicted, and consistent with prior work (Casillas et al., Reference Casillas, Brown and Levinson2020, Reference Casillas, Brown and Levinson2021; Sacks et al., Reference Sacks, Schegloff and Jefferson1978; Stein et al., Reference Stein, Menti and Rosemberg2021), we found that more talkers leads to more input, both for TCDS and ADS. This effect is due in part to the simple fact that, all else being equal, the presence of more talkers leads to more talk. The presence of more talkers increases competition for the floor (Holler et al., Reference Holler, Alday, Decuyper, Geiger, Kendrick and Meyer2021; Sacks et al., Reference Sacks, Schegloff and Jefferson1978). When there are four or more individuals present, as is the average case in all but the English-speaking groups (see Supplementary Materials Section 2), there is an opportunity for interactants to break off into smaller conversations (e.g., two, two-person conversations), potentially doubling the observable talk via overlapping conversation in the input (Sacks et al., Reference Sacks, Schegloff and Jefferson1978). Future work might selectively examine subsets of daylong data to more precisely characterize how interactionally driven factors such as the number of talkers present accounts for fluctuations in a child’s linguistic input rate, both within and across language groups. Doing so will likely require more transcribed interactions than the current dataset offers.
It is notable that multi-party talk and, in general, observable talk between others is abundant across these groups, with raw ADS estimates typically far outpacing individually addressed child input (i.e., TCDS). While a variety of language-learning processes may indeed benefit from early exposure to observable talk (e.g., syntactically complex prosodic structures in adult-adult conversation), ADS learning effects in infancy and early toddlerhood have gone largely unexamined (though see, e.g., Akhtar et al., Reference Akhtar, Jipson and Callanan2001; Foushee et al., Reference Foushee, Srinivasan and Xu2021; Oshima-Takane et al., Reference Oshima-Takane, Goodz and Derevensky1996). Given the current results, it may be worthwhile to dig deeper into how observable talk and multi-party talk influence the very early stages of language learning (e.g., de León, Reference de León1998; de León & Garcia-Sánchez, Reference de León and García-Sánchez2021), especially the early development of non-referential linguistic knowledge (e.g., regarding phonology, see Cristia, Reference Cristia2020; regarding conversational turn taking, see Dunn & Shatz, Reference Dunn and Shatz1989). As we discuss in more detail below, an important addendum to this discussion is that systematic differences in multi-party interaction might also be understood as cultural – not solely situation-specific.
Minimal evidence for baseline differences in language group
Regarding effects of language group on baseline TCDS and ADS input rates, we had predicted that children acquiring Argentinian Spanish, Tseltal, and Yélî Dnye would encounter lower rates of TCDS and higher rates of ADS compared to North American English-acquiring children. By and large, our data show little evidence for these hypotheses. When it came to TCDS, we only observed one case where input rates differed: Yélî Dnye’s baseline TCDS rate was significantly lower than that of North American English. When it came to ADS, we found no evidence for differences in baseline ADS rate between North American English and other language groups. This set of results may come as a surprise, considering that the raw rates of TCDS and ADS clearly vary between language groups, in ways that often align with our original predictions (e.g., the raw ADS rate of Yélî Dnye is nearly two and a half times larger than that of North American English; Table 4). A reasonable question is then why our model estimates don’t reflect these differences. Beyond the concern of statistical power – which is relevant given our relatively small samples – it is essential to think again about where differences between groups could come from. In particular, it’s worth re-examining how to understand the effect of the number of talkers present: could this be group-specific behavior or not?
We observed that many of the apparent inconsistencies with mean overall TCDS and ADS rate come from systematic between-group differences in the number and composition of talkers. For example, Yélî Dnye children had an average of 6.06 talkers present in addition to the target child, while North American English children only had 1.81. So, even if the baseline rate of TCDS is significantly lower per talker in Yélî Dnye (as our model above suggests), there are many more talkers present in the Yélî children’s acoustic environment compared to the North American children. Consequently, the overall experienced TCDS by children in these two groups appears overall comparable (Table 4).
We tested this idea in a post-hoc analysis where we removed number and type of talker from the regression models, only leaving child age and language group as predictors (Supplementary Materials Section 4).Footnote 7 To the extent that number of talkers and talker type are correlated with language group, their associated variance will be incorporated with language group effects in these simpler models (Wurm & Fisicaro, Reference Wurm and Fisicaro2014), giving us an interpretive view closer to that implied by the by-language-group averages in Table 4. The simpler models suggested no evidence for difference in TCDS rate between North American English and the other groups, and suggest significantly higher ADS in Yélî Dnye (and significantly lower likelihood of a zero-ADS clip) compared to North American English. In sum, the results for Yélî Dnye look very different depending on whether variance in the number of talkers (and thereby variance in the quantity of TCDS/ADS) can be attributed to the language group (i.e., in the “simple” models) or whether it’s pulled out as a separate, nuisance predictor (i.e., in the primary models). This point is important for two reasons.
First, the same data can lead to very different conclusions depending on what variance is treated as group-specific vs. not. From an ethnographic perspective, it may be completely valid to consider features like number and composition of talkers a part of children’s specific cultural and linguistic milieu. The number of talkers present, after all, likely relates to cultural practices around childcare (e.g., alloparenting), household organization (e.g., multigenerational housing), and daily activities (e.g., food preparation routines; see Gaskins (Reference Gaskins and Göncü1999) and Casillas (Reference Casillas2023) for more discussion of these issues). Put differently, variation in the number of talkers present can signal group-specific routines, practices, and interactional contexts. However, our perspective is that, in order to understand how these factors might generally and cross-culturally influence children’s linguistic input, we need to analyze them as (partly) separate from culture. Doing so gives us a glimpse into how basic processes of conversational coordination and caregiving may shape children’s input in broadly similar ways across diverse human groups and thus give us insight into how children learn language so robustly across widely varied home environments.
Second, and complementarily, a lack of group effects (beyond differences in the prevalence of multiparty talk) does not imply that early language environments are cross-culturally and cross-linguistically similar. Our measures represent highly simplified quantifications of two sources of linguistic input – one designed specifically for the target child and one designed for adults – but capture nothing about the content of naturalistic input or its integration into children’s interactive or multi-modal experiences (e.g., Bergelson et al., Reference Bergelson, Amatuni, Dailey, Koorathota and Tor2019a; Broesch et al., Reference Broesch, Rochat, Olah, Broesch and Henrich2016; de León, Reference de León1998; Kuchirko et al., Reference Kuchirko, Tafuro and Tamis-LeMonda2018; Montag, Reference Montag, Denson, Mack, Xu and Armstrong2020; Rosemberg et al., Reference Rosemberg, Alam, Audisio, Ramirez, Garber and Migdalek2020, Reference Rosemberg, Alam, Ramirez and Ibañez2023; Rowe, Reference Rowe2012). The measure we use here, while a crucial starting point, is too coarse to make detailed conclusions regarding qualitative similarities or differences in children’s early language experiences. To do so, we would need much more than the present data can offer, at least: (1) detailed generative models of how much input children encounter, from whom, and under what conditions – to which the present study contributes; (2) an understanding of the content of that input and how it fluctuates under different conditions; and (3) documentation of the local cultural, institutional, social, economic, and material realities that may radically change the experienced linguistic input.
Finally, we note that our findings do not cleanly divide between so-called “WEIRD” and “non-WEIRD” (Henrich et al., Reference Henrich, Heine and Norenzayan2010) groups. For example, Yélî Dnye and Tseltal – the two rural subsistence communities represented here – do not pattern together in our data, and neither do the two historically related urban post-industrial populations – North American and UK English (see Cristia, Reference Cristia2023 for further discussion). This highlights the importance of considering each population in its own right when making claims about cultural and linguistic similarities and differences. While ultimately we hope to pinpoint areas of similarity and systematic variation in language development across a wide variety of developmental contexts, it is far too early to make universalist claims about patterns in children’s real-world language experiences. The WEIRD or non-WEIRD distinction, while helpful to illustrate cultural biases in behavioral research, can also unfortunately reinforce those same biases by grouping together very distinct cultural groups in opposition to a Western, primarily North American, groups (for further discussion relating specifically to infant research, see Singh et al., Reference Singh, Cristia, Karasik, Rajendra and Oakes2023).
Limitations
There were minor methodological variations in sampling and in transcribers and transcription due to the logistical constraints in doing annotation across different labs and language settings – this minor variation is inevitable in comparative work on naturalistic interaction. We carefully considered these minor deviations, and have no reason to believe that they impacted our findings in any meaningful way. Of greater concern, however, is whether our collection of annotated clips constitute enough data to reveal true underlying effects. We sampled randomly over the course of the daylong recording to capture a representative sample of young children’s input, which often includes “down” time moments. Given the diversity of populations in our metacorpus, random sampling was also the most straightforward way to ensure that our sampling method itself did not introduce confounds across corpora (e.g., if we had picked high-vocal-activity segments only or otherwise activity-centered moments like “play”). However, the highly zero-inflated nature of children’s daily experiences (Mendoza & Fausey, Reference Mendoza and Fausey2021) challenges our statistical approach and interpretations.
Best estimates to date suggest that our sample size (22.5–45 minutes per recording) is reasonable for obtaining preliminary stable estimates (Cychosz et al., Reference Cychosz, Villanueva and Weisleder2020; Micheletti et al., Reference Micheletti, de Barbaro, Fellows, Hixon, Slatcher and Pennebaker2020). For example, Marasli and Montag (Reference Marasli, Montag, Goldwater, Anggoro, Hayes and Ong2023) examine estimated versus true word counts from daylong recordings using a variety of random sampling schemes, finding that a total of 30 minutes of randomly sampled 1–5-minute clips yields accurate average word count estimates, with varying but symmetrical rates of error depending on clip duration (shorter is better). Word count and utterance duration are highly correlated, and utterance duration directly corresponds to our measure of quantity (see DeAnda et al., Reference DeAnda, Bosch, Poulin-Dubois, Zesiger and Friend2016; see also Räsänen et al., Reference Räsänen, Seshadri, Lavechin, Cristia and Casillas2021 for evidence from the specific corpora used here). Therefore we consider the currently sampled data as sufficient for an accurate approximation of input rates from daylong recordings. However, we leave it to future work to refine this assumption based on a greater diversity of daylong recording types.
Indeed, while we may have sampled sufficiently within recordings to create stable estimates for each child, the present analyses would be more powerful if done over more recordings, with a greater number of language groups, and/or with a more systematic or theory-driven selection of cultural or linguistic contexts to study. These are persistent problems in the field of developmental science (e.g., Kosie & Lew-Williams, Reference Kosie and Lew-Williams2022; Oakes, Reference Oakes2017; Singh et al., Reference Singh, Cristia, Karasik, Rajendra and Oakes2023) and so, as usual, any null effects should be taken as preliminary.
Importantly, we see the present study as an initial assessment of differences between these populations in children’s home linguistic experiences, and do not believe that any single study should be considered the final word in comparisons of this nature. Indeed, another weakness of the current study is a lack of deep incorporation of existing ethnographic and language socialization claims about these populations (Brown & Gaskins, Reference Brown, Gaskins, Enfield, Kockelman and Sidnell2014; Gaskins, Reference Gaskins, Enfield and Levinson2006; Ochs & Schieffelin, Reference Ochs, Schieffelin, Schweder and LeVine1984). What our findings do highlight is that specific facets of behavioral patterns (e.g., housing arrangements, child caregivers, etc.) are visible in quantitative measures of children’s language environment in ways that allow us to identify axes of cross-linguistic and cross-cultural variation that are relevant for developing generalizable theories of language learning. By looking deeper at the local context for each dataset, we would better understand variation within each. For example, the construct of “socioeconomic status” is so different between these communities that it is hard to imagine a meaningful way of directly comparing between groups. Instead, within-population analyses that take into account individual and collective power within social hierarchies and relevant local institutions seem much more likely to shed light on socioeconomic effects across these corpora.Footnote 8 We thus strongly urge readers to take caution in generalizing our results (or those of other researchers) beyond the current data, to new populations.
Finally, the present set of analyses examines input without taking into account patterns of target child vocalization, turn-taking with the target child, or examining how overlapping vocalizations would change the estimates presented here (e.g., Broesch et al., Reference Broesch, Rochat, Olah, Broesch and Henrich2016; de León, Reference de León1998; Donnelly & Kidd, Reference Donnelly and Kidd2021; Elmlinger et al., Reference Elmlinger, Goldstein and Casillas2023; Kuchirko et al., Reference Kuchirko, Tafuro and Tamis-LeMonda2018; Scaff et al., Reference Scaff, Casillas, Stieglitz and Cristia2023). Examinations of the target children’s vocalizations and their active interaction with other talkers is outside of the scope of the present paper but is an active area of work by the present author team. Determining what overlapping vocalizations may be seriously degraded in children’s perception of their input (Erickson & Newman, Reference Erickson and Newman2017; Hall et al., Reference Hall, Grose, Buss and Dev2002) is also beyond the scope of the present paper, complicated by the varying types and levels of background noise, the time spent outdoors, and the activity contexts in which overlap is embedded (e.g., two simultaneous adult conversations versus simultaneous chanting of a phrase by three children playing a game). Surely excluding all overlapping talk would reduce the estimates presented here, but we are unconvinced that doing so would contribute much more to our understanding than the current data do. For research directly considering this issue of overlapping speech and its impact on input estimates, we point readers to work by Scaff et al. (Reference Scaff, Casillas, Stieglitz and Cristia2023).
Conclusion
Our findings revealed that, across a diverse set of cultural and linguistic contexts, the quantity of input directed to children during the first three years is both relatively low and stable across age. Overhearable adult-directed input is much more available, but our preliminary evidence suggests that it decreases across age. Language group also impacts who input is likely to come from, especially when it comes to directed input from other children, which is more common in some groups than others. That said, women’s input predominates overall. Finally, the number of talkers who are present matters a great deal for the amount of language encountered, both target-child directed and adult-directed. These results add to a growing body of work quantifying the outsize role women’s input plays in children’s early language exposure across varied cultural and linguistic groups. It also highlights the fact that children’s relative exposure to input from other talker types – especially language from other children – is an important and understudied aspect of their early linguistic input. Finally, it underscores the importance of understanding how other aspects of everyday life drive patterns in language exposure (e.g., the number of others present), opening up pathways for future work to more precisely pinpoint the nature of these differences and their relationship to early language development.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S030500092400028X.
Acknowledgements
This research was supported by the Social Sciences and Humanities Research Council of Canada (435-2015-0628, 869-2016-0003) and by the Natural Sciences and Engineering Research Council of Canada (501769-2016-RGPDD) to Melanie Soderstrom; by the National Endowment for the Humanities (HJ-253479-17), National Institutes of Health Grant DP5-OD019812, and National Science Foundation BCS-1844710 to Elika Bergelson; a CONICET grant, PIP 80/2015, and a MINCyT grant, PICT 3327/2014 to Celia Rosemberg; and an NWO Veni Innovational Scheme Grant (275-89-033) to Marisa Casillas. We thank Anne Warlaumont and Caroline Rowland for contributing their datasets to this project and for helpful feedback on this manuscript. Finally, we thank the families who participated in the recordings that made this research possible.