“It is uncontroversial that both languages of a bilingual are jointly activated during all linguistic processing, even in strongly monolingual contexts in which the nontarget language would be considered inappropriate” (Bialystok, Reference Bialystok2010, p. 562). Indeed, many studies have found evidence of cross-language influences on processing, such as the activation or the inhibition of phonologically related word-forms (e.g., De Groot, Delmaar & Lupker, Reference De Groot, Delmaar and Lupker2000; Duyck, Reference Duyck2005; Lemhöfer & Dijkstra, Reference Lemhöfer and Dijkstra2004). Some studies have defined phonologically similar word-forms as being cognates (words in two languages with a common etymological origin resulting in similar phonological characteristics and the same meaning: map (English) and mapa (Spanish); e.g., Costa, Caramazza & Sebastián-Gallés, Reference Costa, Caramazza and Sebastián-Gallés2000), or homophonous in some way (e.g., homographs, homophones, pseudo-homophones; two words that look or sound the same but have different meanings: see/sea (English) and sí (Spanish)).
Rather than use words like cognates or homophonous word-forms (which are relatively unique in the languages of the world and therefore limit the generalizability of studies that employ them), the present corpus analysis measured phonological neighborhood density in an English and a Spanish lexicon to assess the extent to which words in one language are phonologically similar to words in the other language. Phonological neighborhood density refers to the number of words that sound similar to a given word, and has been shown to influence a variety of language-related processes including word learning (Storkel, Armbruster & Hogan, Reference Storkel, Armbruster and Hogan2006), word recognition in English (Luce & Pisoni, Reference Luce and Pisoni1998), word recognition in Spanish (Vitevitch & Rodríguez, Reference Vitevitch and Rodríguez2005), word production in English (Vitevitch, Reference Vitevitch2002), word production in Spanish (Vitevitch & Stamer, Reference Vitevitch and Stamer2006, Reference Vitevitch and Stamer2009), and serial-recall of words (Roodenrys et al., Reference Roodenrys, Hulme, Lethbridge, Hinton and Nimmo2002).
If many word-forms in one language are phonologically similar to many word-forms in another language, then it is reasonable to assume that there might be a large amount of cross-language activation or inhibition among word-forms. Large amounts of activation or inhibition from another language might indeed present the lexical processing system of the bilingual with a difficult computational problem, one which requires additional mechanisms and processes to retrieve the correct word-form from the “correct” language (e.g., in production: Green (Reference Green1998), in perception: Dijkstra & van Heuven (Reference Dijkstra and Van Heuven2002); among others).
Conversely, if few word-forms in one language are phonologically similar to word-forms in another language, then the lexical processing system in the bilingual individual might not be as challenged as previous studies imply. This is not to say that cross-language influences on processing do not exist; a large number of studies have demonstrated such influences with a variety of methodologies and languages. Rather, such a finding might undermine the need for some of the cognitive mechanisms (e.g., Bialystok, Reference Bialystok2010) or representational schemes (e.g., Green, Reference Green1998) that have been proposed in the bilingual individual, compelling researchers to consider and explore alternative explanations for the observations made to date (e.g., Meara, Reference Meara2006).
Methods
The English corpus used in this present analysis contained the 19,340 words from the Merriam–Webster Pocket Dictionary (1964; see Nusbaum, Pisoni & Davis, Reference Nusbaum, Pisoni and Davis1984; Storkel & Hoover, Reference Storkel and Hoover2010; Vitevitch & Luce, Reference Vitevitch and Luce2004, for additional information about this corpus), and the Spanish corpus consisted of a randomly sampled set of 19,340 words from the LEXESP database (Sebastián-Gallés et al., Reference Sebastián-Gallés, Marti-Antonin, Carreiras-Valina and Cuetos-Vega2000). Both lexicons have been used in numerous psycholinguistic studies and corpus analyses (e.g., Arbesman, Strogatz & Vitevitch, Reference Arbesman, Strogatz and Vitevitch2010; Sandoval et al., Reference Sandoval, Gollan, Ferreira and Salmon2010). Using equal numbers of words in the two lexica facilitated comparison within and between the two languages. Importantly, the words in each corpus occurred in their respective languages with approximately equal token frequency (Spanish mean frequency = 45.92 occurrences per million (sd = 1779.14); English mean frequency = 40.76 occurrences per million (sd = 724.160); t(38678) = .37, p = .71) further attesting to the comparability of the two corpora.
Words were considered phonological neighbors of each other if they differed by the addition, deletion, or substitution of a single phoneme (Greenberg & Jenkins, Reference Greenberg, Jenkins, Jakobovits and Miron1967; Landauer & Streeter, Reference Landauer and Streeter1973; Luce & Pisoni, Reference Luce and Pisoni1998; see also Levenshtein, Reference Levenshtein1966; Vitevitch, Reference Vitevitch2008). For example, the English words key and bee are neighbors because they differ by the substitution of one phoneme, /b/ for /k/. The English word key and the Spanish word sí are also neighbors because they differ by the substitution of one phoneme, /s/ for /k/. This method of operationally defining phonological similarity has psychological validity (e.g., Cutler et al., Reference Cutler, Sebastián-Gallés, Soler-Vilageliu and van Ooijen2000; Luce & Large, Reference Luce and Large2001), and has been shown to produce results qualitatively similar to other operational definitions of phonological similarity (Luce & Pisoni, Reference Luce and Pisoni1998).
The same phonological transcription was used to represent phonemes that were common to both languages. Ignoring the well-known phonetic differences in the way certain phonemes are realized in each language (e.g., differences in voice-onset time) actually biases the present analysis to identify more words as phonological neighbors than a real speaker of the two languages might identify (see Ju & Luce, Reference Ju and Luce2004, for evidence that listeners use these fine-grained phonetic differences to activate word-forms only in the appropriate language during word recognition). As will be seen below, however, this bias to identify more words as phonological neighbors than a real speaker of the two languages might identify makes the results of the present analysis perhaps even more surprising.
Results
Phonological neighbors within each language
Looking at the phonological neighbors within each language, 47% of the words in the English lexicon had one or more English words as a phonological neighbor, whereas 27% of the words in the Spanish lexicon had one or more Spanish words as a phonological neighbor.Footnote 1 The proportion of words with phonological neighbors in this sample of Spanish words is comparable to the value obtained by Arbesman et al. (Reference Arbesman, Strogatz and Vitevitch2010) for the full LEXESP lexicon, suggesting that the random sample that was selected for the present analysis is representative of the larger population of Spanish words.
Phonological neighbors between each language
Looking at the phonological neighbors between the two languages, only 4% of the 19,340 English words had one or more Spanish words as phonological neighbors (about 774 words). For the English words with Spanish neighbors, the increase in the size of the phonological neighborhood was only 1.55 neighbors (mean value).Footnote 2
In the case of Spanish, only 2% of the 19,340 Spanish words had one or more English words as phonological neighbors (about 387 words). For the Spanish words with English neighbors, the increase in the size of the phonological neighborhood was 3.58 neighbors (mean value).
The proportion of foreign and domestic neighbors
Another way to look at phonological neighbors between the two languages is to consider the total number of neighbors that each word has (both “foreign” and “domestic” neighbors) and assess the proportion of neighbors from each language; see Figure 1. When we examine the English words (the top panel of Figure 1), we find that 9,120 words had a neighbor of some sort, either “foreign” or “domestic” (note that a word must have at least one neighbor in this analysis, because division by 0 is undefined). For those 9,120 words with one or more neighbors, on average 98% of the words in the neighborhood of each word were English words, and only 2% of the words in the neighborhood were Spanish words.
For the Spanish words (the bottom panel of Figure 1), 5,197 words had at least one neighbor of some sort. For those 5,197 words with at least one neighbor, on average 96% of the words in the neighborhood of each word contained Spanish words, and only 4% of the words in the neighborhood of each word contained English words. These results further suggest that the number of words in one language that are phonologically similar to a word in another language is quite small.
Replication with another corpus of Spanish words
To verify that these observations were not spurious results due to the specific words in this sample of Spanish words or to the source of Spanish words that was used, these analyses were repeated with another Spanish lexicon (a random sample of 19,340 words from the 86,061 Spanish words obtained from ftp.ox.ac.uk\pub\wordlists\), and the results were comparable. In this analysis, 32% of the words in the Spanish lexicon had one or more Spanish words as a phonological neighbor. Again, however, the proportion of neighbors from the other language was quite small: only 5.8% of the English words had one or more Spanish words as phonological neighbors, and only 2.5% of the Spanish words had one or more English words as phonological neighbors.
In addition, 6,179 Spanish words had at least one neighbor of some sort. For those 6,179 words with at least one neighbor, on average 95% of the words in the neighborhood of each word contained Spanish words, and only 5% of the words in the neighborhood of each word contained English words. These results obtained from a different database of Spanish words replicate the finding that there are few word-forms in one language that are phonologically similar to word-forms in another language.
Replication taking perceptual assimilation into account
The analyses performed thus far, however, assume ‘perfect’ perception of the sounds that comprise the English and Spanish words. It is well known that phonemic contrasts that exist in a second language are difficult to perceive if they do not exist in the native language. A classic example is the difficulty that native speakers of Japanese have in distinguishing the /r/–/l/ contrast in English, because no such contrast exists in Japanese (e.g., MacKain, Best & Strange, Reference MacKain, Best and Strange1981). Similar difficulties are faced by native speakers of Spanish, a language with five vowels (/a e i o u/), when learning English, a language with about 20 vowels including diphthongs (e.g., /ɑ ɔ e ɛ ɪ i ʊ u o æ ʌ/). To examine the impact that perceptual assimilation of English vowels onto Spanish vowel categories might have on phonological similarity, the same analyses were performed with the vowels in the English words replaced by the Spanish vowels they are most often perceived as (from García Lecumberri & Cenoz Iragui, Reference García Lecumberri and Cenoz Iragui1997, /i/ remained /i/; /e/ and /ɪ/ became /e/; /æ ʌ ɑ ɛ/ became /a/; /ɔ ʊ o/ became /o/; /u/ remained /u/).
The number of English words (with Spanish vowels) that had one or more Spanish words as neighbors increased from 4% in the initial corpus (and 5.8% in the replication with a different Spanish corpus) to 12.8%. Note, however, that the proportion of English words (with Spanish vowels) with one or more Spanish neighbors is still significantly less than the proportion of Spanish words with one or more Spanish words as neighbors (12.8% versus 27%; χ2 (df = 1) = 5.07, p < .05), again suggesting that there are more phonologically similar words within a language than between languages.
Discussion
The results of the present corpus analysis show, in several ways, that words in a foreign language do not “invade” the lexical neighborhoods of another language. That is, for the two languages examined here, there are few words in one language that are phonologically similar to words in the other language. This simple observation raises a number of important and fundamental questions about lexical retrieval and language processing in the bilingual.
First, the minimal amount of phonological overlap between the two languages essentially creates two separate – or perhaps, easily separable – lexica. (Note that other low-level phonological information might further contribute to the separation of languages; see e.g., Ju & Luce, Reference Ju and Luce2004.) The de facto separation between languages based on their phonological characteristics raises a question about the need for explicit representational schemes, such as language tags (Green, Reference Green1998) or language nodes (Dijkstra & van Heuven, Reference Dijkstra, Van Heuven, Grainger and Jacobs1998), or other cognitive mechanisms (e.g., Bialystok, Reference Bialystok2010) designed to keep the word-forms of one language separate from the word-forms of another language. If one considers the small number of words that might benefit from such measures, these approaches to language processing seem cognitively and computationally expensive (and seem increasingly expensive for the individual who knows a third, or fourth, etc. language).
If we consider the possibility that there is no cognitive mechanism or process that keeps the two languages separate in the lexicon of the bilingual, what could keep the two languages separate? There are, of course, a variety of phonological attributes that are used to characterize the languages of the world (e.g., phoneme inventory, phonotactic constraints, typical word-length, canonical syllable structure, etc.). The way in which these phonological characteristics uniquely combine in each language might be sufficient to keep the word-forms of each language essentially separate from each other without requiring an additional or explicit partitioning mechanism or process in the lexicon. Although explicit partitioning mechanisms and processes may appear to be superfluous in the mental lexicon, the possibility remains that such mechanisms or representational schemes might be useful at other levels of language processing (e.g., syntax, semantics, etc.) or for certain language processes (e.g., word learning).
It must be acknowledged that the methodology employed in the present study – corpus analysis – limits what can be said directly about lexical processing. Even though there appear to be very few words in one language that are similar to words in another language, a single word from one language may be all that is needed to significantly affect the speed and accuracy with which lexical processing occurs in the other language. Indeed cross-language influences on processing, such as the activation or the inhibition of related word-forms, have been demonstrated (e.g., in production: Marian & Blumenfeld, Reference Marian and Blumenfeld2006). Other evidence, however, suggests that word recognition in a second language is primarily determined by within-language rather than cross-language factors (e.g. Lemhöfer et al., Reference Lemhöfer, Dijkstra, Schriefers, Baayen, Grainger and Zwisterlood2008). Thus, the processes described in current models of spoken-word recognition might be sufficient to retrieve the correct word-form from the “correct” language; no additional processing mechanisms may be required.
In all current models of spoken-word recognition (cohort theory: Gaskell & Marslen-Wilson, Reference Gaskell and Marslen-Wilson1997; TRACE: McClelland & Elman, Reference McClelland and Elman1986; Shortlist: Norris, Reference Norris1994; Neighborhood Activation Model: Luce & Pisoni, Reference Luce and Pisoni1998), several phonologically similar word-forms compete with each other during the process of spoken-word recognition. It is reasonable to postulate that the same mechanism used to deal with the competition that exists among phonologically similar words within a given language is sufficient to deal with the additional competition that might arise from phonologically similar words in another language. With a relatively efficient mechanism already in place to deal with the competition among phonologically similar words within a given language, there appears to be no need to supplement that process with an additional mechanism to deal with the small number of competitors that might cross the “lexical boarders” from another language.Footnote 3
Although the observations made on the basis of the present corpus analysis are small in number, the implications of these observations are far-reaching, and may compel some researchers of bilingual (and monolingual) language processing to explore alternative accounts of lexical retrieval (e.g., Meara, Reference Meara2006). Furthermore, the observations made in the present corpus analysis raise a number of additional questions for future research. Perhaps the small amount of phonological overlap observed in the present analysis was due to the languages that were examined, English and Spanish. Although English and Spanish are both Indo-European languages, English is from the Germanic branch, whereas Spanish comes from the Romance branch. Perhaps if two Romance or two Germanic languages were considered, a larger amount of phonological overlap might be observed.Footnote 4 This leads to an additional testable hypothesis: the amount of cross-language influence observed in lexical processing might be related to the amount of phonological overlap that exists between the two languages (Brauer (Reference Brauer, Healy and Bourne1998) and Dijkstra et al. (Reference Dijkstra, Miwa, Brummelhuis, Sappelli and Baayen2010) indeed found processing differences as a function of language similarity, whereas Costa, Santesteban & Ivanova (Reference Costa, Santesteban and Ivanova2006) failed to find processing differences as a function of language similarity). Similarly, the cognitive advantages often associated with being bilingual (e.g., Bialystok, Reference Bialystok2010) might be dependent on the two languages that one knows: an individual who knows two languages with a large amount of phonological overlap may possess a stronger executive control system than an individual who knows two languages with a small amount of phonological overlap.
The apparent asymmetry in the extent to which words of one language “invade” the lexical neighborhoods of the other language also warrants additional investigation: English words showed an increase in neighborhood size of only 1.55 Spanish neighbors, but Spanish words showed an increase in neighborhood size of 3.58 English neighbors. This asymmetry suggests that something other than proficiency in the languages may affect cross-language influences in lexical processing (to the extent that they exist): in concurrent bilinguals with equal levels of proficiency there may be a greater influence of English words on Spanish processing than of Spanish words on English processing.
On a methodological note, the corpus analysis employed in the present study might also prove to be a useful approach in other areas of language research. For example, historical or comparative linguists could use a technique similar to the one employed in the present study to measure the occurrence of phonological overlap between two languages to serve as a baseline for the rate of occurrence of cognates, etc., or to assess the likelihood that one language branched off from another language. Intriguing research questions and novel methodological approaches such as these might not have been posed in the absence of the present observations.