Introduction
Theory of Mind (ToM) is usually defined as the ability to attribute mental states to others in order to predict or explain their behavior (for a review, see Sabbagh & Bowman, Reference Sabbagh, Bowman, Wixed and Ghetti2018). The large body of ToM research on monolingual children gathered over the last 40 years has established various cognitive and social predictors of ToM development at preschool age, e.g., Executive Functions or social abilities (see for review: Hughes & Devine, Reference Hughes, Devine, Whitebread, Grau, Kumpuleinen, McClelland, Perry and Pino-Paternak2019). Also, skills related to language have been listed as important predictors of ToM, e.g., general language abilities, receptive vocabulary, or understanding of complement clauses (for a review, see Astington & Baird, Reference Astington and Baird2005; de Villiers & de Villiers, Reference de Villiers and de Villiers2014; Milligan et al., Reference Milligan, Astington and Dack2007; for a meta-analysis, see Tompkins et al., Reference Tompkins, Farrar and Montgomery2019). Despite a clear connection between language abilities and ToM, the potential impact of bilingualism on ToM has been addressed in relatively few studies; recent reviews report only 16 (Schroeder, Reference Schroeder2018) or 24 (Yu et al., Reference Yu, Kovelman and Wellman2021) such studies.
Given the importance of linguistic abilities for ToM development in monolinguals, it seems reasonable to expect that bilingualism (i.e., the knowledge and the experience of living within a two-language environment) should exert some impact on how ToM develops. The seminal study of Peggy Goetz (Reference Goetz2003) brought the first empirical support for an advantage of bilingual children in ToM. Since then, at least 13 studies with children below the age of seven years but above the age of four have explored this idea further. Several of these studies provided evidence in favor of the idea that bilingualism might indeed lead to earlier development of ToM, although negative evidence is also available (for a systematic review, see Rubio-Fernández, Reference Rubio-Fernández2017; for a meta-analysis, see Schroeder, Reference Schroeder2018; Yu et al., Reference Yu, Kovelman and Wellman2021). However, still little is known about the underlying mechanism(s) and the predictors of ToM ability in bilinguals. Especially lacking is an understanding of the role that language skills as well as L1 and L2 input play in ToM development.
It is undisputed that growing up with two languages naturally establishes a different linguistic environment and different computational demands than growing up with just one language. In other words, not only language skills (e.g., proficiency in two languages) but also experience with more than one language (e.g., the quantity and quality of input) may impact ToM abilities. Therefore, the main goal of this paper is to address the important gaps in the literature and explore how “language factors” – which are considered critical in ToM development in monolingual children (see Hughes & Devine, Reference Hughes, Devine, Whitebread, Grau, Kumpuleinen, McClelland, Perry and Pino-Paternak2019 for a review; Milligan et al., Reference Milligan, Astington and Dack2007 for a meta-analysis) – contribute to ToM reasoning in bilinguals aged 4–6.
Theory of Mind in monolinguals: critical predictors and pitfalls in assessment
ToM, or ‘mindreading ability’, refers to the attribution of mental states, i.e., beliefs, thoughts, feelings, desires, emotions, or intentions, to others in order to predict or explain their behavior. This ability has been widely investigated for almost 40 years in monolinguals, both adults and children (see, e.g., Apperly, Reference Apperly2011; Hughes & Devine, Reference Hughes, Devine, Whitebread, Grau, Kumpuleinen, McClelland, Perry and Pino-Paternak2019; Wellman, Reference Wellman2014 for a review). In younger children, before and just after their fourth birthday, a standard first-order false-belief task (Wimmer & Perner, Reference Wimmer and Perner1983) has typically been used. Although the use of only one type of task to measure a complex ability may raise a reliability of measurement issue (Hughes et al., Reference Hughes, Adlam, Happé, Jackson, Taylor and Caspi2000), it is only recently that sets of tasks or scales have been developed for toddlers and young children (e.g., Białecka-Pikul et al., Reference Białecka-Pikul, Szpak, Haman and Mieszkowska2018; Wellman et al., Reference Wellman, Fang and Peterson2011). For older children (aged 4–6) who pass the first-order false-belief task, various other tasks have been developed, each of which probably taps slightly different ToM abilities.
The false-belief tasks commonly used with four-year-olds and younger children are the unexpected transfer task (Wellman et al., Reference Wellman, Cross and Watson2001) and the unexpected content task (Perner et al., Reference Perner, Leekam and Wimmer1987). These are first-order belief tasks as they require the tested child to consider the beliefs of a story character. Usually, children successfully pass these tasks at age four (Wellman et al., Reference Wellman, Cross and Watson2001). Importantly, both of these tasks are “pass-fail” tasks, meaning there is a 50% likelihood that a particular child passes the task by chance, which creates problems with interpreting results. One way to overcome this challenge is to supplement the main test questions (e.g., Where will Max look for the chocolate?) with a justification question (e.g., Why will he look there?). However, previous research has rarely used this strategy (for an exception, see, e.g., Białecka-Pikul et al., Reference Białecka-Pikul, Szpak, Haman and Mieszkowska2018). Such a direct question may be very helpful as it provides information on whether a child can explicitly refer to the mental states of others (e.g., that the story character does not know that Mum moved the chocolate, or he thinks the chocolate is there) and explain a character's reasoning process. Thus, by requiring children to justify their answers, we gain better insight into a child's thinking process and conceptual development (Lombrozo, Reference Lombrozo2006); therefore, we can measure ToM reasoning more accurately. Naturally, “why” questions put higher linguistic demands on a child than “what” or “where” questions (de Villiers, Reference de Villiers1991).
Another solution for improving measurement of ToM in older children is to use a series of tasks of varied difficulty. More difficult ToM tasks tap into recursive thinking, i.e., thinking about thinking abilities (Miller, Reference Miller2012). For example, in the second-order false-belief task (Perner & Wimmer, Reference Perner and Wimmer1985; Tager-Flusberg & Sullivan, Reference Tager-Flusberg and Sullivan1994) that is used with five- and six-year-olds, the child is required to consider a character's belief about another character's belief (e.g., Mum thinks that Max thinks that…). Other tasks for older children measure interpretative abilities (Chandler & Lalonde, Reference Chandler, Lalonde, Sameroff and Haith1996), such as understanding of interpretation (Lalonde & Chandler, Reference Lalonde and Chandler2002), understanding of ambiguity (Carpendale & Chandler, Reference Carpendale and Chandler1996), deception (Talwar et al., Reference Talwar, Gordon and Lee2007), or understanding of somebody's surprise (Hadwin & Perner, Reference Hadwin and Perner1991). All these tasks are definitely more complex than standard first-order stories, and they impose greater linguistic demands. As such, they provide a more sensitive assessment of ToM development in children older than four years.
The importance of language skills for ToM performance has been demonstrated in longitudinal studies as well as in research on atypical populations (e.g., Mazza et al., Reference Mazza, Mariano, Peretti, Masedu, Pino and Valenti2017) or intervention studies (e.g., Lohmann & Tomasello, Reference Lohmann and Tomasello2003). Reviewing all the studies that have used this methodological perspective is beyond the scope of the paper; however, in short, a meta-analysis by Milligan et al. (Reference Milligan, Astington and Dack2007) found that language abilities (e.g., semantics or syntax) explained a large portion of variance in ToM (effect size, r = .43). In all the longitudinal studies included in this meta-analysis, the relation between early language ability and later ToM was stronger than the opposite (i.e., early ToM and later language ability), which suggests that language provides a foundation for ToM development – not the other way round (Hughes & Devine, Reference Hughes, Devine, Whitebread, Grau, Kumpuleinen, McClelland, Perry and Pino-Paternak2019). In sum, in monolingual children, the better the language abilities, the more enhanced the ToM development, at least in 4 year-olds, who have not started systematic and formal language education.
ToM abilities in monolinguals have also been shown to be impacted by factors such as 1) age (e.g., Wellman et al., Reference Wellman, Cross and Watson2001; a critical change in false-belief understanding occurs in four-year-olds); 2) gender (e.g., Walker, Reference Walker2005, girls outperformed boys); 3) socio-economic status (e.g., Devine & Hughes, Reference Devine and Hughes2018; the higher the status, the better the ToM, r = .18); and 4) ‘executive function’ (see the meta-analysis by Devine & Hughes, Reference Devine and Hughes2014 – the more developed the EF, the higher the ToM, r = .38)Footnote 1. All these factors should be taken into account or at least controlled for in research on ToM reasoning in children in early and middle childhood.
Theory of Mind in bilinguals: state of the art
As has been indicated, there are grounds to suggest that bilingual children may develop ToM abilities earlier than their monolingual peers. As stated initially by Goetz (Reference Goetz2003) and more recently by Schroeder (Reference Schroeder2018) and Yu et al. (Reference Yu, Kovelman and Wellman2021), a bilingual advantage in ToM should be expected for at least three different reasons, all of which are grounded in research comparing bilinguals with monolinguals. First, some studies suggest that bilinguals demonstrate greater meta-linguistic awareness than monolinguals, possibly because representations stored in two languages strengthen their general meta-representational skills (e.g., Doherty, Reference Doherty2000). In other words, using two languages to communicate with others can help (or even be essential for) metalinguistic and metacognitive abilities, including reasoning about our own and other people's thinking processes. Second, bilinguals show greater socio-linguistic or pragmatic abilities (by being able to adjust their language to others even at the age of two – e.g., Genesee et al., Reference Genesee, Boivin and Nicoladis1996), as well as an enhanced ability to follow the perspective of the interlocutor while communicating. Thus, these communicative abilities may in turn enhance thinking about the content of other people's minds. Finally, bilingual children often demonstrate more enhanced cognitive control abilities than monolinguals (e.g., Bialystok & Craik, Reference Bialystok and Craik2010), which in themselves may provide a necessary or just an important factor supporting the bilingual advantage in ToM.
Schroeder (Reference Schroeder2018) conducted a meta-analysis of 16 studies that compared ToM performance between bilingual and monolingual children (N = 1283) and revealed a small bilingual advantage in ToM ability (Cohen's d = 0.22). The effect reached medium size (Cohen's d = 0.58) when the transformed ToM scores were statistically adjusted for bilingual vs monolingual differences in language proficiency. As argued by Schroeder, the results provide support for a beneficial effect on ToM reasoning of acquiring two languages. However, the studies included in Schroeder's meta-analysis differed in various factors that were not accounted for in the analysis. These included 1) the selection and matching of bilingual and monolingual samples; 2) the type of children's exposure to the second language (e.g., simultaneous vs. sequential); 3) the age range of the tested children; 4) the particular tasks used to measure ToM; 5) the L1 and L2 skills of bilinguals (if both were tested); and 6) the language in which these children were tested. It is therefore unclear to what extent the large heterogeneity across the studies contributed to the relative weakness of the observed main effect in ToM abilities and why, as suggested by Schroeder (Reference Schroeder2018), adjusting ToM scores “for bilingual-monolingual differences in language proficiency” (p.8) enhanced the strength of the effect.
More recently, Yu et al. (Reference Yu, Kovelman and Wellman2021) reviewed 24 studies investigating the relation between ToM and bilingualism. Echoing the conclusion of Schroeder (Reference Schroeder2018), Yu et al. state that the bilingual advantage for ToM development appears modest. These authors also suggest that meta- or socio-linguistic accounts provide a more plausible albeit less studied explanation of the ToM advantage than accounts that assume the critical role of executive functions. In our opinion, the conclusion formulated by Yu et al. is premature because there are no longitudinal or experimental studies that have directly investigated how experience with two languages impacts development of ToM. Importantly, all three explanations can be in fact complementary in many ways. For example, socio-linguistic skills can impact ToM directly or via meta-linguistic skills. Alternatively, EF can impact ToM directly or be a mediator between socio-linguistic factors and ToM.
Below, we provide a more in-depth qualitative analysis of the previous research, focusing on the role of the linguistic abilities and language exposure that may have played a crucial role in the outcomes of that research. We constrained our analysis to studies with children older than four years of age but younger than seven years of age. We assume that these children can pass first-order FBU tasks and start passing more sophisticated tasks (e.g., second-order FBU). Moreover, at around the age of seven, children start using “language for learning” and thus develop metalinguistic skills which probably impact ToM abilitiesFootnote 2. Additionally, as mentioned above, ToM abilities change quite substantially between the ages of four and seven (see Astington & Hughes, Reference Astington, Hughes and Zelazo2013).
Studies on ToM in bilinguals – similarities and differences in methodology
Our review includes thirteen published studiesFootnote 3 out of which eight found a bilingual advantage in ToM and five did not. Our aim was to focus on factors which were different across the studies and consider whether these differences might have biased the outcomes. We identified four such factors: participant age and the age range, the sample-matching strategy, the type and number of ToM tasks used, and language proficiency. Below, we summarize the outcomes of our review, while considering each of the factors (see Table 1 for details of the review).
Note: a Age of children is denoted as mean (M) in number of years and months (e.g., 4;4 means 4 years 4 months) and as a range in the same notation. If available, standard deviation (SD) is also presented; b PPVT – Peabody Picture Vocabulary Test Third Edition
Participant age and the age ranges across the studies
The tested samples were relatively small (the smallest N = 14) to moderate (the biggest N = 98). The total age range of the tested children was 2 years 1 month (notation: 2;1) to 6;10; the assessed children's mean age in nine studies was 4;4. Notably, these wide age ranges make a between-studies comparison of ToM and the language abilities of the tested children quite problematic.
Matching strategy
In all the studies, the compared groups were matched on age. Five studies additionally matched the groups on gender, while eight studies gave no information on matching by gender. In seven studies, parental SES or the level of education were comparable between groups; in three studies, the parents of monolingual children had higher SES than the parents of bilinguals; and in three studies, no information regarding SES was provided. Overall, it is clear that the compared samples of bilinguals and monolinguals were not fully matched on the sociodemographic variables that may impact ToM (Hughes et al., Reference Hughes, Jaffee, Happé, Taylor, Caspi and Moffitt2005).
Type and number of ToM tasks used
In eleven of the thirteen studies included in our analysis, one to six standard ToM tasks (false-belief understanding tasks: deceptive box task or unexpected transfer task, appearance-reality task) were used. In two studies, the tasks measured not ToM per se but social communication abilities (Fan et al., Reference Fan, Liberman, Keysar and Kinzler2015) or cognitive perspective taking (Han & Lee, Reference Han and Lee2013). The differences in the executive as well as linguistic demands of the different tasks used to measure ToM make it difficult to provide any general conclusion regarding how such a complex ability as ToM relates to another complex ability and/or context of learning ToM as language. Moreover, when measuring ToM with the use of one or two “pass vs. not passed” tasks, the reliability of such measurement is disputable.
Language as a factor in study design or in analysis of results
The language of testing seems to be rather critical for the accurate assessment of ToM abilities not only because – as we argued earlier – ToM ability, even in monolinguals, is impacted by language ability, but also because bilinguals’ language skills in each language are typically lower than those of age-matched monolinguals (Bialystok et al., Reference Bialystok, Luk, Peets and Yang2010; Bonifacci et al., Reference Bonifacci, Barbieri, Tomassini and Roch2017; Haman et al., Reference Haman, Wodniecka, Marecka, Szewczyk, Białecka-Pikul, Otwinowska, Mieszkowska, Łuniewska, Kołak, Miękisz, Kacprzak, Banasik and Foryś-Nogala2017; Hoff et al., Reference Hoff, Rumiche, Burridge, Ribot and Welsh2014). Our review revealed that only three of the thirteen reviewed studies tested bilingual participants for ToM abilities in both languages which they knew (see Table 1). If, we recognize that language not only serves to reveal ToM, but also allows ToM's development (see Moses, Reference Moses2001), then studying ToM in two languages of a bilingual child seems critical.
Moreover, in only three of the ten remaining studies in which bilinguals’ ToM was tested in one language, bilingual children were tested in their dominant (as objectively tested) or preferred (as pointed by parents) language; in five they were tested in their L2 (language of formal education); and in the other two studies they were tested in their L1 (home language). Again, even when bilinguals’ ToM is tested in the dominant or preferred language, bilinguals could be expected to perform lower than their monolingual peers because they typically have smaller language skills in each of their languages compared to monolinguals (e.g., Haman et al., Reference Haman, Wodniecka, Marecka, Szewczyk, Białecka-Pikul, Otwinowska, Mieszkowska, Łuniewska, Kołak, Miękisz, Kacprzak, Banasik and Foryś-Nogala2017; Łuniewska et al., Reference Łuniewska, Wójcik, Kołak, Mieszkowska, Wodniecka and Haman2022). Importantly, it is also difficult to speculate if home language – and, in general, home environment – is more or more or less important for ToM development than the language input and skills acquired in the education system (this depends of the characteristics of daycares, time spent there and quality of interactions provided by such institutions). Importantly, in all of the reviewed studies, the precise characteristics of children's language exposure (i.e., the quantity and quality of input) were very differently measured, thus it is difficult to compare their results.
Additionally, information about the quantity and quality of input in first and second languages was provided for only four of the 13 studies (see column 2 and 3 of Table 1). In one of the studies, the daily exposure was defined only by stating that “(a) parents of different mother tongues who each address the child in their native language; and (b) daily exposure to both languages” (Kovács, Reference Kovács2009, p. 50). However, in another study (Nguyen & Astington, Reference Nguyen and Astington2014), more detailed information was provided (i.e., bilinguals were exposed to both English and French before 8 months for a minimum of 30% of the time, and monolinguals were exposed only to their native language from birth for 90% of the time). The lack of information about the language input made the control of this variable impossible. In consequence, the impact of language input on ToM has not been established.
To account for the potentially lower language skills in bilinguals, three of the thirteen reviewed studies matched the tested groups by language skills; eight studies controlled for language skills when comparing ToM in monolinguals and bilinguals. In the two of the reviewed studies, language skills were not tested at all; in one study, language proficiency was tested but the results were not provided by the authors. Overall, previous research on ToM only rarely fully controlled for proficiency and exposure in both languages when testing bilingual children, and the study designs typically did not allow the role of language proficiency and language input in the ToM abilities of bilinguals to be directly investigated.
Is the ‘bilingual advantage’ in ToM real or not? The role of language proficiency and exposure
The exact impact of language experience and proficiency on ToM in bilinguals is still unknown. Interestingly, some authors have put forward the idea that it is the mere fact of living in a bilingual language environment rather than the length and/or intensity of this experience that plays a role. For example, Yow and Markman (Reference Yow and Markman2015) argued that it is bilingual children's practice in understanding other people's linguistic perspectives which may boost their ToM reasoning, regardless of their language proficiency. Also, Fan et al. (Reference Fan, Liberman, Keysar and Kinzler2015) found that even “some regular but limited exposure” (p. 1091) to the second language confers an advantage in perspective-taking ability to the same degree as everyday contact with two languages. Below, we briefly summarize the types of outcomes of previous studies (no bilingual advantage, bilingual advantage, and advantage contingent on language proficiency), with a special focus on if and how the language proficiency of the tested children was measured and accounted for in the analyses (see Table 1 for details). If information on a second language input was provided in the source articles, we included it in Table 1 (column 2 and 3).
Out of the five studies that did not find differences in ToM between bilinguals and monolinguals, in three (Dahlgren et al., Reference Dahlgren, Almén and Dahlgren Sandberg2017; Gordon, Reference Gordon2016; Pearson, Reference Pearson2013) monolinguals’ language abilities outperformed those of bilinguals (two other studies gave no information on language performance). This implies that there might have been a competing effect of language proficiency in the language of the ToM testing (in two studies this was L2); if this proficiency had been controlled for, it cannot be ruled out that a bilingual advantage could have been observed in ToM skills.
Out of the eight studies that found a bilingual advantage in ToM, four reported this advantage both overall and also when language abilities were controlled for. The exact pattern of results is, however, difficult to interpret: in one study (Goetz, Reference Goetz2003, children tested in both languages), better language skills were observed in monolinguals. In another study (Farhadian et al., Reference Farhadian, Abdullah, Mansor, Redzuan, Gazanizadand and Kumar2010; children were tested in their L2), bilinguals presented better language skills than monolinguals. In two other studies, no significant differences in language abilities between bilinguals and monolinguals were observed (Kovács, Reference Kovács2009; Fan et al., Reference Fan, Liberman, Keysar and Kinzler2015 – children were tested in L2). Thus, overall, a ToM advantage was observed when language skills were fully controlled for or when bilinguals’ L2 skills were at least not lower than those of monolinguals if L2 was the language of testing.
Importantly, in the remaining four studies, a bilingual advantage in ToM was reported only when the impact of language skills was statistically restrained or eliminated (see Table 1). In all these studies, monolinguals outperformed bilinguals in language skills. This implies that bilingualism not only compensates for lower skills in the language of testing that are important for ToM, but it also enhances ToM development more than language abilities per se.
Our review highlights the substantial heterogeneity across the studies, which might have contributed to the reported inconsistency of the previous research findings. We identified the following limitations of the previous studies. First, in most of this research there was a wide age range of the tested children. This might have affected the outcomes in an uncontrolled way as differences in age are intertwined with differences and changes in language abilities. Second, in many of the studies, the factors that might have impacted ToM abilities (i.e., age, gender, cognitive abilities, and parental SES) were not systematically controlled for. Third, in most of the studies, ToM was tested in only one of the bilinguals’ languages, and sometimes this language was the children's weaker language. Most importantly, based on previous studies it is unclear how different aspects of bilingual language experience and language skills contribute to ToM.
Current study
Our first goal was to compare Polish–English bilingual children aged 4–6 with Polish monolinguals in Theory of Mind. Our second goal was to better understand the extent to which language proficiency and input explain ToM abilities. Based on the review of previous findings, we formulated the following hypotheses:
H1: Polish–English bilingual children aged 4–6 outperform same-aged monolinguals in Theory of Mind.
H2: In monolinguals, proficiency in their native language relates to ToM.
H3: In bilinguals, both proficiency in L1 and L2 and input in these languages relate to ToM abilities.
While testing these hypotheses, we aimed to circumvent at least some of the limitations of the previous studies. Therefore, we (1) tested a relatively large group of children in aged 4–6; (2) we carefully matched the compared groups on variables that have been found to impact both ToM and language proficiency in monolinguals (age, gender, SES, cognitive abilities as measured with IQ); (3) we tested children's ToM in their dominant language (here: Polish as the home language); (4) we obtained data for several linguistic factors that are related to language proficiency and input; 5) we probed a wide range of ToM abilities, including first- and second-order false-belief understanding as well as accuracy and justification of children's answers in eight ToM tasks.
Our participants were Polish–English migrant children aged 4–6 in the UK, and a group of Polish monolingual peers in Poland. A battery of eight tasks was used to assess ToM (Test of Reflection on Thinking; TRT, Białecka-Pikul et al., Reference Białecka-Pikul, Szpak, Haman and Mieszkowska2018), which allowed us to calculate four indices: (1) overall accuracy index, (2) overall justification index, (3) first-order false-beliefs index, and (4) second-order false-beliefs index. We considered five predictors associated with either language proficiency or language exposure in bilinguals. The first two predictors were related to language proficiency as measured via language comprehension and based on performance scores in receptive vocabulary tests in L1 and L2. The remaining three predictors related to language exposure in bilinguals and were obtained via parental reports: length of L2 (English) exposure (in months); the accumulated language input in L1 and L2, i.e., cumulative language exposure indices based on both the total time spent in Poland and in the UK and on the amount and intensity of bilinguals’ exposure to their languages in both these countries (see below and see also Haman et al., Reference Haman, Wodniecka, Marecka, Szewczyk, Białecka-Pikul, Otwinowska, Mieszkowska, Łuniewska, Kołak, Miękisz, Kacprzak, Banasik and Foryś-Nogala2017).
To test our first hypothesis, we conducted a series of ANCOVAs. Then, we conducted a series of regression analyses with the four outcomes of the ToM tasks (TRT) as indices of dependent variables. For hypothesis 2, we regressed ToM on L1 proficiency (after controlling for age, gender, SES, cognitive abilities as measured with IQ); for hypothesis 3, the five predictors related to language proficiency and input (described in detail below) were provided (again, after controlling for demographic and cognitive abilities). To supplement the analyses related to the first hypothesis, we then performed Bayesian analyses (see analytical strategy for details).
Method
Participants
Participants were children who took part in a large-scale project on the linguistic and cognitive development of bilingual children (Bi-SLI-PL, see Acknowledgements for details), carried out within the European COST Action IS0804. Overall, 173 Polish–English migrant children living in the UK and 311 Polish monolingual children living in Poland were tested in the project. All children were of school entrance age (age 4–6, see footnote 2). Written parental consent and children's assent were obtained for all participants. The participants were not reimbursed, but the children received “small rewards” (books, stickers, CDs with songs/nursery rhymes). The whole procedure was evaluated and accepted by the Ethics Committee at the Faculty of Psychology, University of Warsaw.
The analyses presented in the current paper are based on the biggest possible subsamples from the group of Polish–English bilinguals and Polish monolinguals (see Haman et al., Reference Haman, Wodniecka, Marecka, Szewczyk, Białecka-Pikul, Otwinowska, Mieszkowska, Łuniewska, Kołak, Miękisz, Kacprzak, Banasik and Foryś-Nogala2017 and supplementary materials, Appendix 1 for the full description of how we selected the subsamples of monolinguals and bilinguals to make them as comparable as possible in reference to the control variables). In total, data from 102 children (51 bilingual and 51 monolingual) were considered for the comparison of ToM performance in bilinguals and monolinguals. The characteristics of the overall sample and the subsamples are presented in Table 2.
Note: OTSR: Obrazkowy Test Słownikowy Rozumienie (The Picture Vocabulary Test – Comprehension in Polish), BPVS: British Picture Vocabulary Scale. TRT: Test of Reflection on Thinking. The overall bilingual and monolingual samples consist of children who match a profile of a typically developing child and have performed the ToM task. The subsamples consist of children who performed the ToM task in Polish, performed the non-verbal IQ task, obtained at least the 5th percentile in the Polish word comprehension test, and had the full set of background information (e.g., age, SES, and in case of bilinguals, information on language exposure).
The groups were identical in terms of gender distribution (31 girls in each group), and there were no differences in their age in months: t(100) = -0.57, p =.570 (two-tailed). The number of years of mothers’ education were comparable for the two language groups, t(100) = -0.22, p =.829 (two-tailed), and most mothers had higher education (in the bilingual group – 35 mothers, i.e., 68.6%; in the monolingual group – 37 mothers, i.e., 72.5%). Moreover, there was no difference between the groups in non-verbal IQ, t(100) = 0.294, p = .769 (two-tailed). As such, the bilingual and monolingual groups were comparable in terms of basic cognitive and socio-demographic characteristics.
With regards to proficiency in L1 (Polish), indicated by the percentile score obtained in OTSR (Obrazkowy Test Słownikowy – Rozumienie, i.e., The Picture Vocabulary Test – Comprehension; Haman & Fronczyk, Reference Haman and Fronczyk2012), monolinguals scored higher than bilinguals: t(100) = 1.02, p = .046 (two-tailed). In terms of the overall sample, monolinguals largely outperformed bilinguals. The matching procedure made the subsamples as similar as possible, but even though it diminished the between-group difference to the verge of significance, it was impossible to obtain equal L1 performance in both groups.
Measures and procedure
Below, we present a detailed description of the main tasks we used: TRT, which is a measure of ToM; two auditory word comprehension tests and most importantly – all measures of L1 and L2 exposure in bilinguals. Note that the complete testing battery used in this study is described in more detail elsewhere (Haman et al., Reference Haman, Wodniecka, Marecka, Szewczyk, Białecka-Pikul, Otwinowska, Mieszkowska, Łuniewska, Kołak, Miękisz, Kacprzak, Banasik and Foryś-Nogala2017); only a short description of all the tasks is presented in the Procedure section.
ToM
Test of Reflection on Thinking (TRT, Białecka-Pikul et al., Reference Białecka-Pikul, Szpak, Haman and Mieszkowska2018)
The TRT was developed for children aged 4–6 over four years old and constitutes a battery of nine tasks (one training task and eight testing tasks) in the form of illustrated stories. More specifically, the TRT's tasks or stories assess a child's understanding of appearance-reality, first-order beliefs (i.e., the unexpected transfer test and the deceptive box test), understanding of interpretation, deception, ambiguity, understanding of surprise, and second-order beliefs. Nine stories are presented by the experimenter in a set order and are aided by pictures (two to five in each story) displayed on a laptop screen (19”). Table S1 in the Supplementary Materials shows a sample story with the accompanying pictures; a detailed description of all tasks is shown in Table S2. Each story describes the actions of two protagonists (two boys or two girls because there are two gender-related versions of TRT). To measure ToM abilities after each story, the child is asked to predict the protagonist's behavior or thoughts (e.g., “where will Evan look for the book?”) and also to explain the protagonist's behavior (e.g., “why will Evan be looking there?”). In other words, in TRT two kinds of questions are asked after each story, thus two indices of ToM can be calculated: overall accuracy index and overall justification index. The overall accuracy index is a sum of points scored in the questions concerning the behavior, thoughts or emotions of the protagonists (e.g., “what will she do?”, “what will she think?”, “how will she feel?”). For each question, a child can score one point for a correct answer and zero points for an incorrect answer, an “I don't know” answer, or no answer. The overall justification index is the sum of points scored in the “why?” questions. A child can score one point for a correct answer without clear mental references (e.g., if a child explains the protagonist's behaviors by referring to a situation or desires), and two points when clear mental references (i.e., thoughts, knowledge, beliefs) are provided. Zero points are given for a wrong answer, an “I don't know” answer, no answer, and if the answer for the accuracy question was scored zero.
Five (out of nine) stories include control questions (memory questions, e.g., “where is the book now?”); if a child failed to provide a correct answer to the memory question, he/she received 0 points for the story. The first story in TRT serves as a training story, so the child's answer is not included in the calculations of the ToM indices. Thus, a child could score a maximum of 8 points on the ToM overall accuracy index and a maximum of 16 points on ToM overall justification index. To provide a more detailed analysis of the ToM variable, the two additional indices related to the accuracy of answers in the first- and second-order false-belief tasks were calculated. The first-order false-beliefs index is a sum of the accuracy scores of two tasks: the unexpected transfer test (story 2 in TRT) and the deceptive box test (story 3 in TRT). The second-order false-beliefs index is a sum of scores for the two second-order false-belief tasks (stories 8 and 9 in TRT).
To check the reliability of the coding system for TRT, the inter-rater reliability (for two independent coders), measured on a randomly selected subsample of monolinguals (n = 38), was calculated and assessed as satisfactory (kappas ranged from .84 to 1.00 for tasks on both scales). The inter-rater reliability for both indices measured with alphas (n = 254) was also satisfactory (.62 for the ToM accuracy index and .64 for the ToM justification index). There are also data that prove the good convergent and content validity of the TRT in monolinguals (see Białecka-Pikul et al., Reference Białecka-Pikul, Szpak, Haman and Mieszkowska2018).
Language factors
Auditory word comprehension was measured in English via the British Picture Vocabulary Scale – Third Edition (BPVS, Dunn et al., Reference Dunn, Dunn and Styles2009), and in Polish via Obrazkowy Test Słownikowy – Rozumienie, OTSR (The Picture Vocabulary Test – Comprehension; Haman & Fronczyk, Reference Haman and Fronczyk2012). Both tests (BPVS and OTSR) are published and normed on monolingual populations and were designed to assess the comprehension of nouns, verbs, and adjectives. The two tests have similar instructions: children are presented with boards of four pictures and asked to point to the picture that appropriately depicts the target word.
Three measures of language experience in the bilingual group were used: (1) length of time of L2 (English) exposure; (2) an index of cumulative language exposure to L1; and (3) an index of cumulative language exposure to L2. All three measures were based on the information from a parental questionnaire for bilingual pre-school and early-school children, i.e., a Polish adaptation of PABIQ [Questionnaire for Parents of Bilingual Children, (Tuller, Reference Tuller, Armon-Lotem, de Jong and Meir2015); Polish adaptation by Kuś et al. (Reference Kuś, Otwinowska, Banasik and Kiebzak-Mandera2012, unpublished)]. Parental answers provided detailed information on each child's early development and language background. The length of L2 (English) exposure was calculated as the time (in months) between the age of first contact with L2 and the time of testing. The indices of cumulative language exposure to L1 and L2 were based on the total time spent in Poland and in the UK (in the child's lifetime), as well as the amount and quality of exposure to language received in each of these countries. The indices were calculated as follows (see also Haman et al., Reference Haman, Wodniecka, Marecka, Szewczyk, Białecka-Pikul, Otwinowska, Mieszkowska, Łuniewska, Kołak, Miękisz, Kacprzak, Banasik and Foryś-Nogala2017). First, we estimated the extent of each child's exposure to each language when living in the United Kingdom on the basis of the parental questionnaire: parents reported, on a 5 point Likert scale, how often their child was addressed in English and Polish in particular communicative situations (e.g., parents talking to the child, other children talking to the child). These scores were aggregated to estimate the children's exposure to Polish and to English during their stay in the United Kingdom. The maximum score for each language was 91, which would indicate that when living in a given country (e.g., Poland) the child had no contact with English. The final index reflected the time spent in Poland and in the United Kingdom in the lifetime of each child, as well as the amount of exposure the child had received in each of these countries. The index of the cumulative exposure to Polish was calculated using the following formula: (time spent in Poland) ∗ 91Footnote 4 + (time spent in the United Kingdom) ∗ (exposure to Polish while in the United Kingdom). The actual unit of measurement used to calculate the index was the child's age in days, represented as years (in decimals). The index of cumulative exposure to English was calculated as (the time spent in Poland) ∗ 0Footnote 5 + (the time spent in the United Kingdom) ∗ (the exposure to English while in the United Kingdom).
Procedure
All children were tested individually in a quiet room: the monolingual Polish children in their preschools or in their homes in Poland; the bilingual children in their schools or in their homes in the United Kingdom (for details see Haman et al., Reference Haman, Wodniecka, Marecka, Szewczyk, Białecka-Pikul, Otwinowska, Mieszkowska, Łuniewska, Kołak, Miękisz, Kacprzak, Banasik and Foryś-Nogala2017). In total, each monolingual child was tested over three to four sessions, and each bilingual child was tested in both languages over five to seven testing sessions (each lasting 45–90 min). The duration of each session depended on the child's pace. The order of the tasks in the testing sessions was counterbalanced across participants. The tasks in Polish were administered by a native speaker of Polish, while the tasks in English (not included in the present report) were administered by a native speaker or a highly proficient speaker of English. Polish and English were never tested on the same day. Each child did 14 tasks in the dominant language and eight tasks in a non-dominant language (see Supplementary Materials, Appendix 2 for the list of tasks). Here, we report only data for the tasks described above and the Raven Scale (Jaworowska & Szustrowa, Reference Jaworowska and Szustrowa2003), performed in the dominant language – namely, Polish.
Results
The analytical strategy
The statistical analyses are presented in the following way. First, we report the preliminary analysis that compares the matched bilingual and monolingual groups on the control variables: age, socioeconomic status, non-verbal intelligence, and Polish (L1) receptive vocabulary size. These comparisons were done with the use of frequentist inference – namely, t-tests. Next, we compared the groups on the four indices provided by TRT: overall accuracy and justification indices, and first-order and second-order false-belief ToM indices. However, we included Polish (L1) receptive vocabulary size as a controlled variable. This was done using a series of ANCOVAs with language group (bilingual, monolingual) as the grouping factor, each of the TRT indices as the dependent variable, and the percentile on the Polish (L1) comprehension test as the covariate. However, frequentist inference only provides evidence against the null hypothesis and cannot provide probabilistic evidence in favor of the alternative hypothesis. We therefore also employed Bayesian inference – namely, a Bayesian ANCOVA with a Bayes FactorFootnote 6 (Hoijtink et al., Reference Hoijtink, Mulder, van Lissa and Gu2019).
Finally, we present the results of a series of regression analyses which looked for predictors of TRT performance in the bilingual and monolingual groups. For the overall TRT accuracy and justification indices, we used a series of hierarchical regression analyses; for the first-order and second-order false-beliefs indices (which are binary variables), we used a series of logistic regression analyses.
Theory of Mind in bilinguals and monolinguals
The results indicated that bilinguals did not outperform monolinguals in the TRT overall accuracy index (F(1,99) = 1.17, p = .283, ƞ2 = .012). The effect of the OTSR percentile covariant was statistically significant (F(2,98) = 13.09, p < .001, ƞ2 = .117, large effect size). We also calculated the Bayes Factors for the comparison between monolinguals and bilinguals. The difference between the groups’ TRT overall accuracy index provided moderate evidence in favor of the null hypothesis (BF01 = 3.60) but virtually no evidence for the alternative hypothesis (BF10) = 0.28. Thus, the Bayes Factor revealed that there was no difference between groups regarding the overall TRT accuracy index.
As for the overall TRT justification index, the groups did differ in the classic hypotheses testing (F(1,99) = 5.95, p = .016, ƞ2 = .057, medium effect size). The effect of the OTSR percentile covariant was statistically significant (F(2,98) = 17.76, p < .001, ƞ2 = .152, large effect size). The difference provided weak evidence in favor of the null hypothesis, BF01 = 1.73. For comparison, the evidence for the alternative hypothesis was even weaker: BF10 = 0.58. Thus, in this case, the Bayesian inference could not help us to choose one hypothesis over another.
We also checked whether there were group differences in performance in the TRT first- and second-order false-beliefs indices. The results showed that bilingual children did not differ from monolinguals in the first- or second-order false-beliefs indices: F(1,99) = 0.02, p = .890, ƞ2 = .001 and F(1,99) = 1.27, p = .262, ƞ2 = .013, respectively. All descriptive statistics are presented in Table 2 in the Method Section. The effect of the OTSR percentile covariant was statistically significant in both analysis (F(2,98) = 7.79, p < .010, ƞ2 = .073, medium effect size and F(2,98) = 4.42, p < .01, ƞ2 = .042, medium effect size, respectively). The Bayes Factor indicated moderate evidence for the null hypothesis: BF01 = 4.44 in the first-order false-beliefs index; a slight preference for the null hypothesis, BF01 = 1.66, in the second-order false-belief index. In essence, the Bayesian Factors indicated moderate to weak evidence for the null hypothesis, which states that the performance on first- and second-order false-belief tasks is similar in both the bilingual and monolingual groups.
The language predictors of ToM
Separate regression models were used to analyze the predictors of ToM performance for each language group, controlling for age, gender, socioeconomic status, and non-verbal IQ. For the monolingual group, only the auditory word comprehension test (OTSR) was added to the regression model. For bilingual children, a series of regression models with different language predictors were calculated: (a) L1 (Polish) and L2 (English) word comprehension (as bilinguals scored lower than monolinguals); (b) L2 length of exposure; (c) cumulative language exposure to L1; and (d) cumulative language exposure to L2.
Monolinguals: regression models for overall TRT accuracy and overall TRT justification indices
Overall TRT accuracy index
The hierarchical regression analysis revealed that sociodemographic variables accounted for 43% of the variation in the TRT overall accuracy index: F(4,46) = 10.44, p < .001. Among these variables, age, β = .34, p = .013, and non-verbal IQ, β = .41, p = .003, were significant predictors (see Table S3 in Supplementary materials). After adding the L1 word comprehension index in Step 2, the total variance explained by the model as a whole was 49%, and the model was statistically significant, F(5,44) = 10.59, p < .001. The inclusion of the L1 word comprehension index explained an additional 6% of variance in the overall TRT accuracy index, ΔR2 = .06, F(1,45) = 6.34, p = .015. In the final adjusted model, age, β = .45, p = .001, non-verbal IQ, β = .28, p = .044, and L1 word comprehension index, β = .29, p = .015, were significant predictors of the overall TRT accuracy index.
Overall TRT justification index
As regards the overall TRT justification index, the base model with sociodemographic variables was statistically significant, F(4,46) = 4.68, p = .003, and accounted for 23% of the variance in the overall TRT justification index. In this model, only non-verbal IQ significantly predicted the quality of bilingual children's justifications in TRT: β = .41, p = .010. Adding the L1 word comprehension index explained an additional 7% of variance, ΔR2 = .07, F(1,45) = 5.44, p = .024, and the final model was statistically significant, F(4,45) = 5.19, p = .001. In this model, the L1 vocabulary comprehension index, β = .32, p = .024, was the only significant predictor of the overall TRT justification index.
Monolinguals: logistic regression models for the TRT first- and second-order false-beliefs indices
Two logistic regression models were used to identify the predictors of the monolinguals’ performance on the TRT first- and second-order false-beliefs indices. In order to run a logistic regression model, the scores from these two indices were transformed from three-level factors (0–12– points) to two-level factors (0 points vs. > 0 points). When reporting statistically significant predictors, we report only those predictors which were significant and for which the 95% confidence interval (CI) for the odds ratio (OR) did not include 1 (if CI of OR includes 1, it means there is no association between the predictors and the outcome).
First-order false-beliefs index
The first model that was used to predict performance on the first-order false-beliefs index (i.e., whether the children scored any points at all or none) included all the sociodemographic variables: gender, age, non-verbal IQ and socio-economic status. The analysis revealed only a significant effect of non-verbal IQ (b = 0.27, SE = 0.13, OR = 1.31, p = 0.047). The second model included the sociodemographic variables and the L1 word comprehension index. However, the effect of language proficiency was non-significant (b = 0.31, SE = 0.19, OR = 1.36, p = 0.107). Model 2 showed a lower Akaike's Information Criterion (AIC) value, which suggests a more parsimonious model (AIC estimates the quality of each model). We followed the model selection criteria set out by Burnham and Anderson (Reference Burnham and Anderson2004): we calculated the difference in AIC values between each model and the model with lowest AIC. The greater the difference, the less likely it is that the model is the best approximating model among the candidates in the set.. Model 2 showed a difference in AIC values larger than 10 (ΔAIC = 14.6), which yields essentially no support for the model as being the best approximating model in the candidate set. Details of the full models are provided in Table S4.
Second-order false-beliefs index
The base model that included the sociodemographic variables was run for the second-order false-beliefs index; it revealed significant effects of gender (girls scored higher than boys, b = 1.71, SE = 0.84, OR = 5.55, p = 0.041) and non-verbal IQ (b = 0.26, SE = 0.12, OR = 1.29, p = 0.030). Model 2, extended by the L1 word comprehension index, revealed the same pattern of results: there were only two significant effects of gender (b = 1.65, SE = 0.85, OR = 5.21, p = 0.041) and non-verbal IQ (b = 0.24, SE = 0.12, OR = 1.27, p = 0.030). The effect of language proficiency was non-significant (b = 0.01, SE = 0.02, OR = 1.01, p = 0.537). Model 2 showed a slightly higher AIC than Model 1. The difference in AIC between Model 2 and Model 1 was smaller than 2 (ΔAIC = 1.62), which provides substantial evidence that Model 2 was the best approximating model in the candidate set. Details of the full models are provided in Table S4.
Bilinguals: regression models for the TRT overall accuracy and justification indices
The base model (Step 1 in all further regressions) that included the sociodemographic and cognitive variables explained 26%, F(4,46) = 5.46, p = .001, and 41%, F(4,46) = 9.68, p < .001, of variance in the TRT overall accuracy index and the TRT overall justification index, respectively. As regards the overall accuracy index, only age was a significant predictor, β = .38, p = .009; for the overall justification index, both age, β = .36, p = .005, and non-verbal IQ, β = .42, p = .001, were significant.
Overall TRT accuracy index: (a) Role of L1 and L2 word comprehension.
In the case of the TRT overall accuracy index, when the L1 word comprehension score was added in Step 2 (model 2), the total variance explained by the model as a whole was 34%, F(5,45) = 6.25, p < .001, and the change in the explained variance was statistically significant, ΔR2 2 = .08, F(1,45) = 6.83, p = .012. Although model 3 (Step 3) with the L2 word comprehension score as a predictor was also statistically significant, F(6,44) = 5.65, p < .001, the change in the explained variance was non-significant, ΔR2 = .02, F(1,45) = 1.97, p = .168. Indeed, in the final adjusted model, only age and the L1 word comprehension index were significant predictors of the overall TRT accuracy index (see Table S5).
Overall TRT accuracy index: (b) Role of length of English (L2) exposure
Adding second-language experience (as measured by length of time of English (L2) exposure) to the model did not increase the explained variance in the overall TRT accuracy index, ΔR2 = .02, F(1,45) = 1.28, p = .265. However, in Step 3, adding L1 comprehension increased the explained variance in the overall TRT accuracy index, ΔR2 = .09, F(1,44) = 7.14, p = .011, and the final model was statistically significant for the accuracy index, F(6,44) = 5.60, p < .001. The overall accuracy index was predicted by age and L1 comprehension (Table S6).
Overall TRT accuracy index: (c) The role of cumulative language exposure to first and second language.
Adding L1 cumulative language exposure to the model in Step 2 did not increase the explained variance in the overall TRT accuracy index, ΔR2 = .02, F(1,45) = 1.37, p = .247, although the final model was statistically significant F(5,45) = 4.68, p = .002. Similarly, adding L2 cumulative language exposure in Step 3 did not increase the explained variance in the overall accuracy index, ΔR2 = .02, F(1,45) = 1.28, p = .265. However, adding L1 comprehension in Step 4 increased the explained variance in the overall TRT accuracy index, ΔR2 = .10, F(1,43) = 8.41, p = .006. The final model was statistically significant for the accuracy index, F(7,43) = 5.09, p < .001. Thus, the overall TRT accuracy index was predicted by age and L1 comprehension (Table S7).
Overall TRT justification index: (a) Role of L1 and L2 word comprehension.
With regards to the overall TRT justification index, adding the L1 word comprehension score in Step 2 increased the explained variance to 48%, ΔR2 = .07, F(1,45) = 7.56, p = .009, and adding the L2 word comprehension index to the model in Step 3 resulted in an additional 6% of explained variance, ΔR2 2 = .06, F(1,44) = 6.00, p = .018. In Step 2, F(5,45) = 10.36, p < .001, and Step 3, F(6,44) = 10.60, p < .001, both models were statistically significant. In the final model, three variables were significant predictors of the overall TRT justification index: age, L1 word comprehension index, and L2 word comprehension index (see Table S5).
Overall TRT justification index: (b) Role of length of English (L2) exposure.
Adding second-language experience (as measured by length of time of English (L2) exposure) to the model did not increase the explained variance in the overall TRT justification index, ΔR2 = .01, F(1,45) = 0.87, p = .356. However, in Step 3, adding L1 comprehension increased the explained variance in the overall TRT justification index, ΔR2 = .08, F(1,44) = 7.91, p = .007. The final model was statistically significant for the justifications, F(6,44) = 8.91, p < .001. The overall justification index was predicted by age, non-verbal IQ, and L1 comprehension (Table S6).
Overall TRT justification index: (c) The role of cumulative language exposure to first and second language.
Adding L1 cumulative language exposure to the model in Step 2 did not increase the explained variance in the overall justification index, ΔR2 = .01, F(1,45) = 1.12, p = .295, although the final model was statistically significant, F(5,45) = 7.99, p < .001. Similarly, adding L2 cumulative language exposure in Step 3 did not increase the explained variance in the overall justification index, ΔR2 = .01, F(1,44) = 1.20, p = .278. However, adding L1 comprehension in Step 4 increased the explained variance in the overall TRT justification index, ΔR2 = .10, F(1,43) = 9.87, p = .003. The final model was statistically significant for the justifications, F(7,43) = 8.51, p < .001. The overall TRT justification index was predicted by age, non-verbal IQ and L1 comprehension (Table S7).
Bilinguals: logistic regression models for the first- and second-order false-beliefs indices
A series of logistic regression models were constructed to identify the predictors of the bilinguals’ performance on the TRT first- and second-order false-beliefs indices. The scores from the first- and second-order false-beliefs tasks were transformed from three-level factors (0–12– points) to two-level factors (0 points, vs. points > 0).
First-order false-beliefs index
The base model for predicting performance on the first-order false-beliefs index (i.e., whether the children scored any points at all or none) included all the sociodemographic and cognitive variables, i.e., gender, age, socio-economic status, and non-verbal IQ. The analysis only revealed a significant effect of gender (b = 2.70, SE = 1.26, OR = 14.94, p = 0.032). In Model 2, the effect of the L1 word comprehension index was non-significant (b = 0.02, SE = 0.03, OR = 1.02, p = 0.546). In Model 3, the effect of the L1 word comprehension index was still non-significant (b = 0.03, SE = 0.05, OR = 1.03, p = 0.548), but the effect of L2 word comprehension was on the verge of significance (b = 0.21, SE = 0.10, OR = 1.23, p = 0.046). In Model 4, the effect of the length of English exposure was non-significant (b = 0.02, SE = 0.03, OR = 1.02, p = 0.571). In Model 5, the effect of L1 cumulative language exposure was non-significant (b = -0.01, SE = 0.01, OR = 0.99, p = 0.433). In Model 6, the effect of L1 cumulative language exposure was non-significant (b = -0.01, SE = 0.01, OR = 0.99, p = 0.488), as was the effect of L2 cumulative language exposure (b = 0.10, SE = 0.05, OR = 1.10, p = 0.709). The lowest AIC value (indicative of the most parsimonious model) was obtained for Model 3, with sociodemographic variables, non-verbal IQ and L1 and L2 word comprehension as predictors (see Table S8). The second best model, Model 1, showed a difference in AIC above 4 and below 7 (ΔAIC = 5.98), which yields considerably less support for the possibility that this model could be the best approximating model in the candidate set. Other models yielded a difference in AIC above 7, providing little support for them being the best approximating models (see Burnham & Anderson, Reference Burnham and Anderson2004 for rules-of-thumb for ΔAIC).
Second-order false-beliefs index
The base model, which included the sociodemographic variables and non-verbal IQ as predictors, was run for the second-order false-beliefs index, but it revealed no significant effects. Model 2 showed significant effects of age (b = 0.00, SE = 0.01, OR = 1.00, p = 0.047), of SES (b = 0.03, SE = 0.13, OR = 1.32, p = 0.027), and of the L1 vocabulary comprehension index (b = 0.05, SE = 0.02, OR = 1.05, p = 0.022). Model 3 revealed the same pattern as Model 2 regarding age, SES and L1 word comprehension index, but the effect of L2 word comprehension was non-significant (b = 0.01, SE = 0.02, OR = 1.01, p = 0.475). Model 4 revealed a significant effect of length of L2 exposure, b = 0.04, SE = 0.02, OR = 1.04, p = 0.044. Model 5 revealed no significant effect of cumulative L1 language exposure, (b = -0.01, SE = 0.01, OR = 0.99, p = 0.095). In Model 6, neither the effects of L1 (b = -0.01, SE = 0.01, OR = 0.99, p = 0.135) nor L2 cumulative language exposure (b = 0.00, SE = 0.01, OR = 1.00, p = 0.599) were significant. The lowest AIC value (indicative of the most parsimonious model) was obtained for Model 2, in which sociodemographic variables (age and SES) and L1 word comprehension index were significant predictors. The second best model, Model 3, showed a small difference in AIC relative to the best model (ΔAIC = 1.47), giving substantial evidence that this model could be alternatively the best approximating model. Model 4 and 5 also showed small differences in AIC relative to the best model (Model 4: ΔAIC = 2.5, Model 5: ΔAIC = 3.77), and the remaining models (Model 1 and Model 6), with ΔAIC between 4 and 7, provided considerably less support for them being the best approximating models. Details of the full models are provided in Table S8.
Discussion
The goal of the current research was to explore the potential differences in ToM between bilinguals and monolinguals aged 4–6. We contrasted a group of Polish–English sequential bilinguals (Polish migrants to the UK) with a group of monolingual peers living in Poland. Importantly, we made all efforts to carefully match the compared groups on several factors that have been previously established as predictors of ToM in monolinguals: age, gender, SES, IQ and L1 word comprehension. Still, perfect matching of the two samples on L1 skills turned out to be impossible, so we used individual children's scores in L1 word comprehension as a covariate in our analyses. The results reveal a new and intricate picture of the role that language proficiency plays in both L1 and L2 in ToM in bilinguals.
Results summary
For monolinguals (tested here as a reference group), we replicated the results of the previous studies: age and language proficiency matter for ToM, and these two variables override the effects of SES on ToM. When we compared bilinguals’ and monolinguals’ ToM abilities using standard frequentist analysis and Bayesian inference, we found no differences in three of the four indices of ToM: the overall accuracy index and the first- and second-order false-beliefs indices. In other words, we revealed no bilingual advantage for the standard measures of ToM. As such, our results are in line with those of Han and Lee (Reference Han and Lee2013), Kyuchukov and de Villiers (Reference Kyuchukov and de Villiers2009), Pearson (Reference Pearson2013, study 4), Gordon (Reference Gordon2016), and Dahlgren et al. (Reference Dahlgren, Almén and Dahlgren Sandberg2017), all of whom found no differences between monolinguals and bilinguals in various ToM tasks. Nevertheless, a complex and informative pattern of interactions was observed in our more nuanced follow-up analyses.
The frequentist analysis showed a significant group difference (medium effect size) for the overall justification index in TRT, which taps into more demanding ToM ability. As a reminder, the “why” question was asked after the child answered the standard test question. The ‘why’ question is considered as more demanding as it requires reasoning about the previous answer and verbalizing the reasoning process. Therefore, based on the frequentist analysis, bilinguals presented justification for their ToM reasoning with more ease than monolinguals; however, the Bayes Factor did not provide enough evidence to claim this hypothesis to be true. Importantly, L1 proficiency turned out to be a significant covariate in the regression analysis and its effect size was large. Thus, we conclude that the bilingual advantage in ToM reasoning is related to the language abilities of sequential bilinguals in their native language, which was also the language of testing.
Third, we found that the overall accuracy of ToM ability in bilinguals was best predicted by the model in which only age and word comprehension in L1 were significant predictors. This model explained over 34% of variance in the ToM accuracy score (SES, gender and non-verbal IQ were non-significant). Thus, it is clear that in bilinguals (as in monolinguals) age and L1 proficiency are important for ToM performance (accuracy) in standard tasks (see Astington & Baird, Reference Astington and Baird2005; Milligan et al., Reference Milligan, Astington and Dack2007). For monolinguals, not only auditory word comprehension but also IQ was a significant predictor. This indicates that ToM development is associated not only with language proficiency but also with fluid intelligence.
Fourth, and most interestingly, in bilinguals the overall ToM justification score was largely (45% of variance) explained by the base model, i.e., the model with sociodemographic and cognitive variables (age, SES, gender, and non-verbal IQ) and L1 word comprehension, where only age, IQ and L1 word comprehension were significant predictors. However, when L2 proficiency was added to the model, 54% of variance in ToM reasoning was explained, and only three predictors remained significant: age, L1 word comprehension and L2 word comprehension. In other words, for questions that were more cognitively and linguistically demanding (“why”) than the standard ToM question (“where”), proficiency in both L1 and L2 were significant, despite the fact that only L1 was overtly used as the language of testing.
Finally, as for the other investigated factors related to language experience, neither the length of L2 exposure nor cumulative language exposure to L1 and L2 provided any additional predictive value for the outcomes of ToM tasks. This might be a consequence of the fact that both these measures rely on parental reports, which might lack sufficient sensitivity and validity (see Hansen et al., Reference Hansen2019).
Getting back to our main hypotheses, based on the current results we do not have sufficient grounds to claim that Polish–English bilinguals aged 4–6 have an overall advantage over their monolingual peers in basic ToM abilities. We also did not observe any clear benefits of greater input in L1 and L2 for the ToM abilities. However, our results indicate that in bilinguals, proficiency in both L1 and L2 (as assessed by vocabulary tests) relates to advanced ToM abilities, i.e., ability to verbally express reasoning behind ToM judgments. Such a relation was observed even though the ToM task did not require the L2 use.
Theoretical and methodological implications
Our study is one of the first to directly investigate the impact of language abilities and language input on ToM in bilinguals. Although we did not find support for the idea that L2 exposure plays a role in ToM development in bilinguals, our results paint a more nuanced picture of the interaction between ToM abilities and specific language factors than previously reported. These findings are in line with the conclusions formulated by Gordon (Reference Gordon2016) as they highlight that ToM in bilinguals benefits from high proficiency across two languages. However, Gordon observed this relation for standard ToM tasks, whereas our results point to a similar relation in more complex and also second-order belief tasks (see also Buac & Kaushanskaya, Reference Buac and Kaushanskaya2020 for a similar attempt with older children).
We also provide the first evidence that bilingualism could be related to the enhanced ability to reflect on the mental states of others. This was demonstrated by the bilingual advantage in reasoning about assumed thoughts of characters in various stories. Based on our findings, it appears that in four- to six-year-old children, differences between bilinguals and monolinguals may only be apparent when a challenging ToM task is used. In bilinguals, answers to such challenging questions seem to be dependent not only on children's proficiency in the language of testing but also on their proficiency in the other (nontarget) language. As such, our findings indicate that for some aspects of ToM to be enhanced it may not be enough to have only limited L2 experience (as suggested by Fan et al., Reference Fan, Liberman, Keysar and Kinzler2015). At least for more cognitively demanding ToM abilities, the achieved proficiency in both languages may matter substantially. Why does proficiency in both languages of a bilingual impact ToM reasoning, regardless of the language of testing?
It is still unclear what the exact mechanism that drives the observed effect is. It could be that knowledge of more than one language supports or scaffolds the ToM abilities involved in reasoning about mental states (required when answering difficult “why” questions). It is also possible that the benefit is linked to bilinguals’ training in more linguistically demanding situations – for example, switching between their languages when talking with different people (e.g., parents vs teachers or peers in daycare). This experience of adjusting the language to the interlocutor may enhance children's socio-linguistic abilities, but it could also lead to the training of executive functions, which then reciprocally feeds into the ToM advantage. Additionally, being immersed in a second-language environment may stimulate bilinguals to reflect on language as a tool that people use to communicate. Finally, it could be that language proficiency is just a proxy of the intensity and length of L2 learning and immersion. In fact, the impact of language proficiency could also be mediated by social-pragmatic skills or meta awareness, both of which go hand in hand with increasing proficiency. Future research should attempt to tackle the issue of the underlying mechanisms (see Yu et al., Reference Yu, Kovelman and Wellman2021 for suggestions of some promising research avenues).
We believe that the finding that L1 and L2 skills relate to advanced ToM in bilinguals opens a new window to investigate the emergence vs. expression hypothesis (Moses, Reference Moses2001). Future research, testing both simultaneous and sequential bilinguals (with different age of L2 acquisition), could address the critical question beyond the scope of the hypothesis – namely, whether language skill (or input) is fundamental for ToM development (as in emergence hypothesis) or is needed only for its expression (as in the expression hypothesis). In general the complexity of the bilingualism phenomenon seems to be a promising window to address this difficult and broad research question. Moreover, combining precise measurements of L2 exposure (accounting for both its quantity and quality) with the longitudinal design could help discover which aspects of language experience are crucial for the enhanced development of the advanced ToM reasoning observed in bilinguals.
We believe that the current results contribute not only to research on bilingualism but also to broader theorizing about the development of Theory of Mind across the lifespan (Warnell & Redcay, Reference Warnell and Redcay2019). Apperly (Reference Apperly2011) proposed a dual-system theory of mindreading abilities, according to which children might be undergoing an important developmental change after the age of 6 years: they gradually begin using high-level mindreading, which is reflected in earlier performance in standard ToM tasks (in addition to the ceiling effects in accuracy). We extend this proposal by suggesting that bilingualism is an example of an experience that supports the transition from low-level (automatic and efficient) mindreading to high-level mindreading (effortful and flexible), as described by |Apperly (Reference Apperly2011, Reference Apperly, Devine and Lecce2021). The use of more than one language on a regular basis is typically grounded in complex social interactions that require effortful monitoring of the language being used by a given interlocutor and selection of the right language in response. As such, bilingualism may constitute a natural context for both training the high-level cognitive abilities that underlie mindreading and for learning to make inferences about other people's minds. Our current findings suggest that bilingual children may indeed manifest these high-level abilities earlier in their development. Importantly, to detect these skills we need to use age-sensitive (more challenging) tasks in which children are asked not only to provide the right solution but also to justify this solution; this requires a child to reflect on their own thinking processes and demonstrate their reasoning about a given social situation. If combined with further qualitative and linguistic analyses of their responses (e.g., the presence of mental terms used by parents – Tompkins et al., Reference Tompkins, Farrar and Montgomery2019; the ability to produce structures containing a complement – Hollebrandse et al., Reference Hollebrandse, van Hout and Hendriks2014), we could gain a better understanding of the possible mechanisms underlying the development of advanced ToM and extend our understanding of the very nature of ToM. As indicated by Apperly (Reference Apperly2011), remarkably little attention has been devoted to children's ability to reflect on the causes, consequences and justifications of people's beliefs – in other words, the study of children's ‘folk epistemology’. We would like to encourage researchers to include justification questions when studying ToM, especially in older children.
Limitations and future studies
It should be noted that our study focused solely on migrant bilinguals. Moreover, our bilingual sample was relatively homogenous not only in terms of the type of bilingualism (sequential) but also in terms of L1 dominance. All bilingual children in our sample performed ToM tasks in their dominant language (L1), which assured that the potential effect of poor language skills in the language of the testing of ToM performance was minimized. However, this sample homogeneity may also mean less generalizability to other types of bilinguals. Our findings support the idea that specific types of bilingual experience likely play a crucial role in the formation of ToM abilities in bilingual children. Given the great heterogeneity of the bilingualism phenomenon, it is critical to investigate ToM across different bilingual communities and populations whose language experiences are varied.
Importantly, our results imply that bilingualism not only compensates for weaker skills in the language of testing that are important for ToM, but also enhances advanced ToM development more than language abilities per se. While selecting the control group of monolingual children, we deliberately selected children who had relatively weak(er) skills in L1 in order to make the group more comparable to bilinguals. It cannot be ruled out, however, that if we allowed monolingual children with strong L1, their performance on ToM reasoning abilities would be better than that of bilinguals.
Another unique aspect of our study is that it focused on bilingual children who were children of migrant families and were tested in L1 but not in L2 environment. The monolingual group was tested in their home country – Poland. Notably, out of the 13 studies presented in our review, only one (Goetz, Reference Goetz2003) compared groups which were settled in different environments. Although we ensured that the two groups did not differ in SES, it is currently unknown how the difference in the testing environment (L1 vs. L2) may have impacted the pattern of results.
Finally, it should be noted that although we made an attempt to account for individual differences in non-linguistic abilities (by including in the models participants’ scores in fluid intelligence), our analyses did not include additional predictors related to executive control or working memory. As indicated in the Introduction, some accounts of ToM development suggest a crucial, possibly mediating role of EF on ToM development, especially in bilinguals. Therefore, future research should definitely employ not only language-related predictors of ToM, but also EF and working memory.
Conclusions
Our study highlights the role of language skills in ToM development. Although bilinguals did not differ from monolinguals in response accuracy in ToM tasks, they demonstrated better reasoning abilities when providing justification for their ToM responses. Moreover, while the ToM accuracy scores were best predicted by L1 proficiency, the justification scores were best predicted by both L1 and L2 proficiency, even though only L1 was needed to perform the task. Overall, the results paint a more nuanced picture of the impact of bilingualism on ToM development. Learning two languages, even sequentially, likely provides fertile ground for the development of more advanced ToM in children aged 4–6 and making inferences about the mental states of others.
Supplementary Material
The supplementary material for this article can be found at http://doi.org/10.1017/S1366728923000652.
Data Availability Statement
The data that support the findings of this study are openly available in the Open Science Framework repository (OSF) at https://osf.io/pzcxt/
Acknowledgements
We express our gratitude to all the children and parents who participated in the study, as well as to all the Bi-SLIPL team members who contributed to collecting and coding the data. Special thanks to Alba Casado for her work on the first draft of Table 1, and to Agnieszka Dynak, Joanna Kolak-Rodis, Magdalena Łuniewska-Etenkowska and Jakub Szewczyk for creating a set of indices out of the parental questionnaire (Kwestionariusz Rozwoju Językowego). We also thank Mike Timberlake for proofreading the text.
Funding
The research presented in this paper was conducted within the Bi-SLI-Poland project “Cognitive and language development of Polish bilingual children at school entrance age – risks and opportunities”. The project was carried out at the Faculty of Psychology, University of Warsaw, Poland in collaboration with the Institute of Psychology, Jagiellonian University, Poland. The project was supported by the Polish Ministry of Science and Higher Education / National Science Centre (Decision 809/N-COST/2010/0). Data collection, data coding and maintenance were also partly supported by a Foundation for Polish Science subsidy to Zofia Wodniecka. The project is linked to the European COST Action IS0804.
Competing interest
The authors declare none.