1. Introduction
A popular research direction in multilingual Natural Language Processing (NLP) consists in learning mappings between two or more monolingual word embedding spaces. These mappings, together with the initial monolingual spaces, define a multilingual word embedding space in which words from different languages with a similar meaning are represented as similar vectors. Such multilingual embeddings do not only play a central role in multilingual NLP tasks but they also provide a natural tool for transferring models that were trained on resource-rich languages (typically English) to other languages, where the availability of annotated data may be more limited.
State-of-the-art models for aligning monolingual word embeddings currently rely on learning an orthogonal mapping from the monolingual embedding of a source language into the embedding of a target language. Somewhat surprisingly, perhaps, this restriction to orthogonal mappings, as opposed to arbitrary linear or even non-linear mappings, has proven crucial to obtain optimal results. The advantages of using orthogonal transformations are twofold. First, because they are more constrained than arbitrary linear transformations, they can be learned from noisy data in a more robust way. This plays a particularly important role in settings where alignments between monolingual spaces have to be learned from small and/or noisy dictionaries (Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2017), including dictionaries that have been heuristically induced in a purely unsupervised way (Conneau et al. Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a; Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2018b). Second, orthogonal transformations preserve the distances between the word vectors, which means that the internal structure of the monolingual spaces is not affected by the alignment. Approaches that rely on orthogonal transformations thus have to assume that the word embedding spaces for different languages are approximately isometric (Barone Reference Barone2016). However, it has been argued that this assumption is not always satisfied (Kementchedjhieva et al. Reference Kementchedjhieva, Ruder, Cotterell and Søgaard2018; Søgaard, Ruder, and Vulić Reference Søgaard, Ruder and Vulić2018; Patra et al. Reference Patra, Moniz, Garg, Gormley and Neubig2019). Moreover, rather than treating the monolingual embeddings as fixed elements, we may intuitively expect that embeddings from different languages may actually be used to improve each other. This idea was exploited by Faruqui and Dyer (Reference Faruqui and Dyer2014), who learn linear mappings from two monolingual spaces onto a new, shared, multilingual space. They found that the resulting changes to the internal structure of the monolingual spaces can indeed bring benefits. In multilingual evaluation tasks, however, their method is outperformed by approaches that rely on orthogonal transformations (Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2016).
While the emphasis has shifted from static word vectors to contextualised language models in recent years, it is worth mentioning that static vectors remain an important case of study. On the one hand, static vectors are still needed in applications where the computational demands of contextualised language models are prohibitive, or where word meaning needs to be captured in the absence of context (e.g., ontology alignment). On the other hand, static vectors can also provide useful prior knowledge when training contextualised models such as mBERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019). In particular, Artetxe, Ruder, and Yogatama (Reference Artetxe, Ruder and Yogatama2020) show how static cross-lingual embeddings can be exploited for zero-shot multilingual transfer of contextualised models.
In this article, we propose a simple method that combines the advantages of orthogonal transformations with the potential benefit of allowing monolingual spaces to affect each other’s internal structure. Specifically, we first align the given monolingual spaces by learning an orthogonal transformation using an existing state-of-the-art method. Subsequently, we aim to reduce any remaining discrepancies by trying to find the middle ground between the aligned monolingual spaces. Specifically, let (w,v) be an entry from a bilingual dictionary (i.e., v is the translation of w), and let $\mathbf{w}$ and $\mathbf{v}$ be the vector representations of w and v in the aligned monolingual spaces. Our aim is to learn linear mappings $\mathbf{M_s}$ and $\mathbf{M_t}$ such that $\mathbf{w}\mathbf{M_s} \approx \mathbf{v}\mathbf{M_t} \approx \frac{\mathbf{v}+\mathbf{w}}{2}$ , for each entry (w,v) from a given dictionary. Crucially, because we start from monolingual spaces that are already aligned, applying the mappings $\mathbf{M_s}$ and $\mathbf{M_t}$ can be thought of as a fine-tuning step. We will refer to this proposed fine-tuning step as Meemi (Meeting in the middle).Footnote a Our experimental analysis reveals that this combination of an orthogonal transformation followed by a simple non-orthogonal fine-tuning step consistently, and often substantially outperforms existing approaches in cross-lingual evaluation tasks. We also find that the proposed transformation leads to improvements in the monolingual spaces, which, as already mentioned, is not possible with orthogonal transformations. This article extends our earlier work in Doval et al. (Reference Doval, Camacho-Collados, Espinosa-Anke and Schockaert2018) in the following ways:
-
(1) We introduce a more general formulation of Meemi, in which the averages that are used to compute the linear transformations can be weighted (e.g. by word frequencies as we explore in this paper).
-
(2) We generalise the approach to an arbitrary number of languages, thus allowing us to learn truly multilingual vector spaces.
-
(3) We more thoroughly compare the obtained multilingual models, extending the number of baselines and evaluation tasks. We now also include a more extensive analysis of the results, for example, studying the impact of the size of the bilingual dictionaries in more detail.
-
(4) In the evaluation, we now include two distant languages which do not use the Latin alphabet: Farsi and Russian. This will further support the generalisation of our conclusions.
2. Background: Cross-lingual alignment methods
In this article we analyse cross-lingual word embedding models that are based on aligning monolingual vector spaces. The overall process underpinning these methods is as follows. Given two monolingual corpora, a word vector space is first learned independently for each language. This can be achieved with standard word embedding models such as Word2vec (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a), GloVe (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014) or FastText (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017). Second, a linear alignment strategy is used to map the monolingual embeddings to a common bilingual vector space. It is worth mentioning that we do not require parallel or comparable corpora to build our multilingual models as in the case of Zennaki, Semmar, and Besacier (Reference Zennaki, Semmar and Besacier2019) or Vulić and Moens (Reference Vulić and Moens2016).
These linear transformations are learned from a supervision signal in the form of a bilingual dictionary (although some methods can also deal with dictionaries that are automatically generated as part of the alignment process; see below). This approach was popularised by Mikolov, Le, and Sutskever (Reference Mikolov, Le and Sutskever2013b). Specifically, they proposed to learn a matrix $\mathbf{W}$ which minimises the following objective:
where we write $\mathbf{x_i}$ for the vector representation of some word $x_i$ in the source language and $\mathbf{z_i}$ is the vector representation of the translation $z_i$ of $w_i$ in the target language. This optimisation problem corresponds to a standard least-squares regression problem, whose exact solution can be efficiently computed (although Mikolov et al. Reference Mikolov, Le and Sutskever2013b do not use this method). Note that this approach relies on a bilingual dictionary containing the training pairs $(x_1,z_1),...,(x_n,z_n)$ . However, once the matrix $\mathbf{W}$ has been learned, for any word w in the source language, we can use $\mathbf{x}\mathbf{W}$ as a prediction of the vector representation of the translation of w. In particular, to predict which word in the target language is the most likely translation of the word w from the source language, we can then simply take the word z whose vector $\mathbf{z}$ is closest to the prediction $\mathbf{x}\mathbf{W}$ .
The restriction to linear mappings might intuitively seem overly strict. However, it was found that higher-quality alignments can be found by being even more restrictive. In particular, Xing et al. (Reference Xing, Wang, Liu and Lin2015) suggested to normalise the word vectors in the monolingual spaces and restrict the matrix $\mathbf{W}$ to an orthogonal matrix (i.e., imposing the constraint that $\mathbf{W}\mathbf{W}^T=\mathbf{1}$ ). Under this restriction, the optimisation problem (1) is known as the orthogonal Procrustes problem, whose exact solution can still be computed efficiently. Another approach was taken by Faruqui and Dyer (Reference Faruqui and Dyer2014), who proposed to learn linear transformations $\mathbf{W_s}$ and $\mathbf{W_t}$ , which, respectively, map vectors from the source and target language word embeddings onto a shared vector space. They used Canonical Correlation Analysis to find the transformations $\mathbf{W_s}$ and $\mathbf{W_t}$ which minimise the dimension-wise covariance between $\mathbf{X}\mathbf{W_s}$ and $\mathbf{Z}\mathbf{W_t}$ , where $\mathbf{X}$ is a matrix whose rows are $\mathbf{x_1},...,\mathbf{x_n}$ and similarly $\mathbf{Z}$ is a matrix whose rows are $\mathbf{z_1},...,\mathbf{z_n}$ . Note that while the aim of Xing et al. (Reference Xing, Wang, Liu and Lin2015) is to avoid making changes to the cosine similarities between word vectors from the same language, Faruqui and Dyer (Reference Faruqui and Dyer2014) specifically want to take into account information from the other language with the aim of improving the monolingual embeddings themselves. Artetxe et al. (Reference Artetxe, Labaka and Agirre2016) propose a model which combines ideas from Xing et al. (Reference Xing, Wang, Liu and Lin2015) and Faruqui and Dyer (Reference Faruqui and Dyer2014). Specifically, they use the formulation in (1) with the constraint that $\mathbf{W}$ be orthogonal, as in Xing et al. (Reference Xing, Wang, Liu and Lin2015), but they also apply a preprocessing strategy called mean centering which is closely related to the model from Faruqui and Dyer (Reference Faruqui and Dyer2014). On top of this, in Artetxe, Labaka, and Agirre (Reference Artetxe, Labaka and Agirre2018a) they propose a multi-step framework in which they experiment with several pre-processing and post-processing strategies. These include whitening (which involves applying a linear transformation to the word vectors such that their covariance matrix is the identity matrix), re-weighting each coordinate according to its cross-correlation (which means that the relative importance of those coordinates with the strongest agreement between both languages is increased), de-whitening (i.e., inverting the whitening step to restore the original covariances) and a dimensionality reduction step, which is seen as an extreme form of re-weighting (i.e., those coordinates with the least agreement across both languages are simply dropped). They also consider the possibility of using orthogonal mappings from both embedding spaces into a shared space, rather than mapping one embedding space onto the other, where the objective is based on maximising cross-covariance. This route is also followed by Kementchedjhieva et al. (Reference Kementchedjhieva, Ruder, Cotterell and Søgaard2018). Other approaches that have been proposed for aligning monolingual word embedding spaces include models which replace (1) with a max-margin objective (Lazaridou, Dinu, and Baroni Reference Lazaridou, Dinu and Baroni2015) and models which rely on neural networks to learn non-linear transformations (Lu et al. Reference Lu, Wang, Bansal, Gimpel and Livescu2015).
A central requirement of the aforementioned methods is that they need a sufficiently large bilingual dictionary. Several approaches have been proposed to address this limitation, showing that high-quality results can be obtained in a purely unsupervised way. For instance, Artetxe et al. (Reference Artetxe, Labaka and Agirre2017) propose a method that can work with a small synthetic seed dictionary, for example, only containing pairs of identical numerals (1,1), (2,2), (3,3), etc. To this end, they alternatingly use the current dictionary to learn a corresponding orthogonal transformation and then use the learned cross-lingual embedding to improve the synthetic dictionary. This improved dictionary is constructed by assuming that the translation of a given word w is the nearest neighbour of $\mathbf{x}\mathbf{W}$ among all words from the target language. This approach was subsequently improved in Artetxe et al. (Reference Artetxe, Labaka and Agirre2018b), where state-of-the-art results were obtained without even assuming the availability of a synthetic seed dictionary. The key idea underlying their approach, called VecMap, is to initialise the seed dictionary in a fully unsupervised way based on the idea that the histogram of similarity scores between a given word w and the other words from the source language should be similar to the histogram of similarity scores between its translation z and the other words from the target language. Another approach which aims to learn bilingual word embeddings in a fully unsupervised way, called MUSE, is proposed in Conneau et al. (Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a). The main difference with VecMap lies in how the initial seed dictionary is learned. For this purpose, MUSE relies on adversarial training (Goodfellow et al. Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio2014), similar as in earlier models (Barone Reference Barone2016; Zhang et al. Reference Zhang, Liu, Luan and Sun2017a) but using a simpler formulation, based on the model in (1) with the orthogonality constraint on $\mathbf{W}$ . The main intuition is to choose $\mathbf{W}$ such that it is difficult for a classifier to distinguish between word vectors $\mathbf{z}$ sampled from the target word embedding and vectors $\mathbf{x}\mathbf{W}$ , with $\mathbf{x}$ sampled from the source word embedding. There have been other approaches to create this initial bilingual dictionary without supervision via adversarial training (Zhang et al. Reference Zhang, Liu, Luan and Sun2017b; Hoshen and Wolf Reference Hoshen and Wolf2018; Xu et al. Reference Xu, Yang, Otani and Wu2018) or stochastic processes (Alvarez-Melis and Jaakkola Reference Alvarez-Melis and Jaakkola2018), but their performance has not generally surpassed existing methods (Artetxe et al. Reference Artetxe, Labaka and Agirre2018b; Glavaš et al. Reference Glavaš, Litschko, Ruder and Vulić2019). For a more comprehensive summary of existing methods, please refer to Ruder, Vulić, and Søgaard (Reference Ruder, Vulić and Søgaard2019).
In this work, we make use of the three mentioned variants of VecMap, namely the supervised implementation based on the multi-step framework from Artetxe et al. (Reference Artetxe, Labaka and Agirre2018a), which will be referred to as VecMap $_{multistep}$ , the orthogonal method (VecMap $_\textrm{ortho}$ ) (Artetxe et al. Reference Artetxe, Labaka and Agirre2016) and its unsupervised version (VecMap $_\textrm{uns}$ ) (Artetxe et al. Reference Artetxe, Labaka and Agirre2018b). Similarly, we will consider the supervised and unsupervised variants of MUSE (MUSE and MUSE $_\textrm{uns}$ , respectively) (Conneau et al. Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a). In the next section, we present our proposed post-processing method based on an unconstrained linear transformation to improve the results of the previous methods.Footnote b
3. Fine-tuning cross-lingual embeddings by meeting in the middle
After the initial alignment of the monolingual spaces, we propose to apply a post-processing step which aims to bring the two monolingual spaces closer together by lifting the orthogonality constraint. To this end, we learn an unconstrained linear transformation that maps word vectors from one space onto the average of that word vector and the vector representation of its translation (according to a given bilingual dictionary). This approach, which we call Meemi (Meeting in the middle), is illustrated in Figure 1. In particular, the figure illustrates the two-step nature, where we first learn an orthogonal transformation (using VecMap or MUSE), which aligns the two monolingual spaces as much as possible without changing their internal structure. Then, our approach aims to find a middle ground between the two resulting monolingual spaces. This involves applying a non-orthogonal transformation to both monolingual spaces.
By averaging between the representations obtained from different languages, we hypothesise that the impact of language-specific phenomena and corpus specific biases will be reduced whereas its core semantic features will become more dominant. However, because we start from aligned spaces, the changes which are made by this transformation are relatively small. Our transformation is thus intuitively fine-tuning the usual orthogonal transformation, rather than replacing it. Note that this approach can naturally be applied to more than two monolingual spaces (Section 3.2). First, however, we will consider the standard bilingual case.
3.1 Bilingual models
Let D be the given bilingual dictionary, encoded as a set of word pairs (w, $w^{\prime}$ ). Using the pairs in D as training data, we learn a linear mapping $\mathbf{X}$ such that $\mathbf{w} \mathbf{X} \approx \frac{\mathbf{w}+\mathbf{w'}}{2}$ for all $(w,w')\in D$ , where we write $\mathbf{w}$ for the vector representation of word w in the given (aligned) monolingual space. This mapping $\mathbf{X}$ can then be used to predict the averages for words outside the given dictionary. To find the mapping $\mathbf{X}$ , we solve the following least squares linear regression problem:
Similarly, we separately learn a mapping $\mathbf{X'}$ such that $\mathbf{w'} \mathbf{X'} \approx \frac{\mathbf{w}+\mathbf{w'}}{2}$ .
It is worth mentioning that we had also experimented with non-linear mappings before arriving at the present formulation. However, multilayer perceptrons paired with different regularisation terms to avoid overfitting, such as penalising mappings that deviated excessively from the identity mapping, obtained lower performance figures, which led us to discard this path at the moment.
We also consider a weighted variant of Meemi where the linear model is trained on weighted averages based on word frequency. Specifically, let $f_{w}$ be the occurrence count of word w in the corresponding monolingual corpus, then $\frac{\mathbf{w}+\mathbf{w'}}{2}$ is replaced by
The intuition behind this weighted model is that the word w might be much more prevalent in the first language than the word $w^{\prime}$ is in the second language. A clear example is when $w=w'$ , which may be the case, among others, if w is a named entity. For instance, suppose that w is the name of a Spanish city. Then, we may expect to see more occurrences of w in a Spanish corpus than in an English corpus. In such cases, it may be beneficial to consider the word vector obtained from the Spanish corpus to be of higher quality, and thus give more weight to it in the average.
We will write Meemi (M) to refer to the model obtained by applying Meemi after the base method M, where M may be any variant of VecMap or MUSE. Similarly, we will write Meemi $_\textrm{w}$ (M) in those cases where the weighted version of Meemi was used.
3.2 Multilingual models
To apply Meemi in a multilingual setting, we exploit the fact that bilingual orthogonal methods such as VecMap (without re-weighting) and MUSE do not modify the target monolingual space but only apply an orthogonal transformation to the source. Hence, by simply applying this method to multiple language pairs while fixing the target language (i.e., for languages $l_{1}, l_{2}, ..., l_{n}$ , we construct pairs of the form $(l_{i}, l_{n})$ with $i \in \{1,...,n-1\}$ ), we can obtain a multilingual space in which all of the corresponding monolingual models are aligned with, or mapped onto, the same target embedding space. Note, however, that if we applied a re-weighting strategy, as suggested in Artetxe et al. (Reference Artetxe, Labaka and Agirre2018a) for VecMap, the target space would no longer remain fixed for all source languages and would instead change depending on the source in each case. While most previous work has been limited to bilingual settings, multilingual models involving more than two languages have already been studied by Ammar et al. (Reference Ammar, Mulcaire, Tsvetkov, Lample, Dyer and Smith2016), who used an approach based on Canonical Correlation Analysis. As in our approach, they also fix one specific language as the reference language.
Formally, let D be the given multilingual dictionary, encoded as a set of tuples $(w_1,w_2,...,w_n)$ , where n is the number of languages. Using the tuples in D as training data, we learn a linear mapping $\mathbf{X_i}$ for each language, such that $\mathbf{w_i}\mathbf{X_i} \approx \frac{\mathbf{w_1}+...+\mathbf{w_n}} {n}$ for all $(w_1,...,w_n)\in D$ . This mapping $\mathbf{X_i}$ can then be used to predict the averages for words in the ith language outside the given dictionary. To find the mappings $\mathbf{X_i}$ , we solve the following least squares linear regression problem for each language:
Note that while a weighted variant of this model can straightforwardly be formulated, we will not consider this in the experiments.
4. Experimental setting
In this section, we explain the common training settings for all experiments. First, the monolingual corpora that were used, as well as other training details that pertain to the initial monolingual embeddings, are discussed in Section 4.1. Then, in Section 4.2, we explain which bilingual and multilingual dictionaries were used as supervision signals. Finally, all compared systems are listed in Section 4.3.
4.1 Corpora and monolingual embeddings
Instead of using comparable corpora such as Wikipedia, as in much of the previous work (Artetxe et al. Reference Artetxe, Labaka and Agirre2017; Conneau et al. Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a), we make use of independent corpora extracted from the web. This represents a more realistic setting where alignments are harder to obtain, as already noted by Artetxe et al. (Reference Artetxe, Labaka and Agirre2018b). For English, we use the 3B-word UMBC WebBase Corpus (Han et al. Reference Han, Kashyap, Finin, Mayfield and Weese2013), containing over 3 billion words. For Spanish, we used the Spanish Billion Words Corpus (Cardellino Reference Cardellino2016), consisting of over a billion words. For Italian and German, we use the itWaC and sdeWaC corpora from the WaCky project (Baroni et al. Reference Baroni, Bernardini, Ferraresi and Zanchetta2009), containing 2 and 0.8 billion words, respectively.Footnote c For Finnish and Russian, we use their corresponding Common Crawl monolingual corpora from the Machine Translation of News Shared Task 2016,Footnote d composed of 2.8B and 1.1B words, respectively. Finally, for Farsi we leverage the newswire Hamshahri corpus (AleAhmad et al. Reference AleAhmad, Amiri, Darrudi, Rahgozar and Oroumchian2009), composed of almost 200M words.
In a preprocessing step, all corpora were tokenised using the Stanford tokeniser (Manning et al. Reference Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky2014) and lowercased. Then we trained FastText word embeddings (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017) on the preprocessed corpora for each language. The dimensionality of the vectors was set to 300, using the default values for the remaining hyperparameters.
In our experiments, we consider, first, 6 Indo-European languages, of which Spanish and Italian are Romance, English and German are Germanic, Russian is Slavic and Farsi is Iranian. Second, we also include experiments for Finnish, which is a Uralic Finnic language (Dryer and Haspelmath Reference Dryer and Haspelmath2013). Finally, we have also included a set of exclusively distant languages: Arabic and Hebrew, both of them Semitic Afro-Asiatic; Finnic Uralic Estonian, Slavic Indo-European Polish and Sino-Tibetan Chinese. For this latter set of languages, we use the pretrained monolingual embeddings available from the FastText website,Footnote e obtained from Common Crawl and Wikipedia. Since we could not access the source corpora for these monolingual embeddings, we could not gather frequency information and therefore we only tested the default variant of Meemi (i.e., not weighted). Furthermore, for the multilingual version of Meemi (see Section 3.2), we consider those languages for which we train the corresponding monolingual embeddings: English, Spanish, Italian, German, Russian, Farsi and Finnish.
4.2 Training dictionaries
We use the training dictionaries provided by Conneau et al. (Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a) as supervision. These bilingual dictionaries were compiled using the internal translation tools from Facebook. To make the experiments comparable across languages, we randomly extracted 8000 training pairs for all language pairs considered, as this is the size of the smallest available dictionary. For completeness we also present results for fully unsupervised systems (see the following section), which do not take advantage of any dictionaries.
4.3 Compared systems
We have trained both bilingual and multilingual models involving up to seven languages. In the bilingual case, we consider the supervised and unsupervised variants of VecMap and MUSE to obtain the base alignments and then apply plain Meemi and weighted Meemi on the results. For supervised VecMap we compare with its orthogonal version VecMap $_\textrm{ortho}$ and the multi-step procedure VecMap $_{multistep}$ . For the multilingual case, we follow the procedure described in Section 3.2 making use of all seven languages considered in the evaluation, that is, English, Spanish, Italian, German, Finnish, Farsi and Russian. Note that in the bilingual case all three variants of VecMap can be used, whereas in the multilingual setting we can only use VecMap $_\textrm{ortho}$ .
5. Intrinsic evaluation
In this section, we assess the intrinsic performance of our post-processing techniques in cross-lingual (Section 5.1) and monolingual (Section 5.2) settings.
5.1 Cross-lingual performance
We evaluate the performance of all compared cross-lingual embedding models on standard purely cross-lingual tasks, namely dictionary induction (Section 5.1.1) and cross-lingual word similarity (Section 5.1.2).
5.1.1 Bilingual dictionary induction
Also referred to as word translation, this task consists in automatically retrieving the word translations in a target language for words in a source language. Acting on the corresponding cross-lingual embedding space which integrates the two (or more) languages of a particular test case, we obtain the nearest neighbours to the source word in the target language as our translation candidates. The performance is measured with precision at k ( $P@k$ ), defined as the proportion of test instances where the correct translation candidate for a given source word was among the k highest ranked candidates. The nearest neighbours ranking is obtained by using cosine similarity as the scoring function. For this evaluation, we use the corresponding test dictionaries released by Conneau et al. (Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a).
We show the results attained by a wide array of models in Tables 1 and 2, where we can observe that the best figures are generally obtained by Meemi over the bilingual VecMap models. The impact of Meemi is more apparent when used in combination with the orthogonal base models, with improvements over the multi-step version of VecMap as well in most languages. These improvements are statistically significant at the 0.05 level across all language pairs, using paired t-tests. On the other hand, using the weighted version of Meemi (i.e., Meemi $_\textrm{w}$ in Table 1) does not seem to be particularly beneficial on this task, with the only exception of English-Farsi. In general, the performance of unsupervised models (i.e., VecMap $_\textrm{uns}$ and MUSE $_\textrm{uns}$ ) is competitive in closely-related languages such as English-Spanish or English-German but they considerably under-perform for distant languages, especially English-Finnish and English-Russian. We have double-checked the anomalous results for English-Finnish, and they appear to be correct under our current testing framework after five runs obtaining the same result. Finally, the results obtained by the multilingual model that includes all seven languages considered, that is, Meemi-multi (VecMap $_\textrm{ortho}$ ) in Table 1, improve over the base orthogonal model, but they do not improve over the results of our bilingual model. We further discuss the impact of adding languages to the multilingual model in Section 7.3.
5.1.2 Cross-lingual word similarity
Cross-lingual word similarity constitutes a straightforward benchmark to test the quality of bilingual embeddings. In this case, and in contrast to monolingual similarity, words in a given pair (a,b) belong to different languages, for example, a belonging to English and b to Farsi. For this task we make use of the SemEval-17 multilingual similarity benchmark (Camacho-Collados et al. Reference Camacho-Collados, Pilehvar, Collier and Navigli2017), considering the four cross-lingual data sets that include English as target language in particular, but discarding multi-word expressions. Also, we use the Multi-SimLex data set published by Vulic et al. (Reference Vulic, Baker, Ponti, Petti, Leviant, Wing, Majewska, Bar, Malone, Poibeau, Reichart and Korhonen2020) for our experiments on the set of exclusively distant languages: Arabic, Hebrew, Estonian, Polish and Chinese. Performance is computed in terms of Pearson and Spearman correlation with respect to the gold standard.
Tables 3 and 4 show the results of the different embeddings models in the cross-lingual word similarity task. Except in a few cases for the VecMap $_{multistep}$ model, our Meemi transformation proves superior to the base models (at the 0.05 level for paired t-tests over all language pairs) and to all their unsupervised variants. For distant languages, where the results are lower overall, our Meemi transformation proves useful, generally outperforming the best VecMap models. Similarly as in the bilingual dictionary induction task, the weighted version of Meemi proves robust only on English-Farsi (Table 4), which suggests that this weighting scheme is most useful for distant languages, as in this case the Farsi monolingual space (which is learned from a smaller corpus and hence, as we will see in the next section, has a lower quality) gets closer to the English monolingual space. As far as the multilingual model is concerned, it proves beneficial in all cases with respect to the orthogonal version of VecMap, as well as compared to the bilingual variant of Meemi.
As for the results with distant languages in Table 4 (using pre-trained FastText embeddings), the trend is even more pronounced. Meemi helps improve the performance in all languages for the MUSE and VecMap orthogonal methods, and it also improves the performance of VecMap $_{multistep}$ in Arabic, Hebrew and Estonian.
5.2 Monolingual performance
One of the advantages of breaking the orthogonality of the transformation is the potential to improve the monolingual quality of the embeddings. To test the difference between the original word embeddings and the embeddings obtained after applying the Meemi transformation, we take monolingual word similarity as a benchmark. Given a word pair, this task consists in assessing the semantic similarity between both words in the pair, in this case from the same language. The evaluation is then performed in terms of Spearman and Pearson correlation with respect to human judgements. In particular, we use the monolingual data sets (English, Spanish, German and Farsi) from the SemEval-17 task on multilingual word similarity. The results provided by the original monolingual FastText embeddings are also reported as baseline.
Table 5 shows the results on the monolingual word similarity task. In this task, our multilingual model representing seven languages in a single space clearly stands out, obtaining the best overall results for English, Spanish and Italian and improving over the base VecMap $_\textrm{ortho}$ model on the rest. With the exception of German, where the multi-step framework of Artetxe et al. (Reference Artetxe, Labaka and Agirre2018a) proves most effective, the plain Meemi transformation improves over the base models, for both VecMap and MUSE.
6. Extrinsic evaluation
We complement the intrinsic evaluation experiments, which are typically a valuable source for understanding the properties of the vector spaces, with downstream extrinsic cross-lingual tasks. This evaluation is especially necessary in the view that the intrinsic behaviour does not always correlate well with downstream performance (Bakarov, Suvorov, and Sochenkov Reference Bakarov, Suvorov and Sochenkov2018; Glavaš et al. Reference Glavaš, Litschko, Ruder and Vulić2019). In particular, for this extrinsic evaluation we will focus on the following question: how does our post-processing method help alleviate limitations of cross-lingual models that are due to their use of orthogonality constraints? In particular, we perform experiments with the orthogonal model of VecMap (i.e., VecMap $_{\text{ortho}}$ ), in combination with the proposed Meemi strategy, both in bilingual and multilingual settings. For the latter case, we considered all six languages, that is, Spanish, Italian, German, Finnish, Farsi and Russian, keeping English as the target language.
The tasks considered are cross-lingual hypernym discovery (Section 6.1) and cross-lingual natural language inference (Section 6.2).
6.1 Cross-lingual hypernym discovery
Hypernymy is an important lexical relation, which, if properly modeled, directly impacts downstream NLP tasks such as semantic search (Hoffart, Milchevski, and Weikum Reference Hoffart, Milchevski and Weikum2014; Roller and Erk Reference Roller and Erk2016), question answering (Prager et al. Reference Prager, Chu-Carroll, Brown and Czuba2008; Yahya et al. Reference Yahya, Berberich, Elbassuoni and Weikum2013) or textual entailment (Geffet and Dagan Reference Geffet and Dagan2005). Hypernyms, in addition, are the backbone of taxonomies and lexical ontologies (Yu et al. Reference Yu, Wang, Lin and Wang2015), which are in turn useful for organising, navigating and retrieving online content (Bordea, Lefever, and Buitelaar Reference Bordea, Lefever and Buitelaar2016). We propose to evaluate the quality of a range of cross-lingual vector spaces in the extrinsic task of hypernym discovery, that is, given an input word (e.g., ‘cat’), retrieve or discover its most likely (set of) valid hypernyms (e.g., ‘animal’, ‘mammal’, ‘feline’ and so on). Intuitively, by leveraging a bilingual vector space condensing the semantics of two languages, one of them being English, the need for large amounts of training data in the target language may be reduced.Footnote f
The base model is a (cross-lingual) linear transformation trained with hyponym-hypernym pairs (Espinosa-Anke et al. Reference Espinosa-Anke, Camacho-Collados, Delli Bovi and Saggion2016), which is afterwards used to predict the most likely (set of) hypernyms given a new term. Training and evaluation data come from the SemEval 2018 Shared Task on Hypernym Discovery (Camacho-Collados et al. Reference Camacho-Collados, Delli Bovi, Espinosa-Anke, Oramas, Pasini, Santus, Shwartz, Navigli and Saggion2018). Note that current state-of-the-art systems aimed at modeling hypernymy (Shwartz, Goldberg, and Dagan Reference Shwartz, Goldberg and Dagan2016; Bernier-Colborne and Barriere Reference Bernier-Colborne and Barriere2018; Held and Habash Reference Held and Habash2019) combine large amounts of annotated data along with language-specific rules and cue phrases such as Hearst Patterns (Hearst Reference Hearst1992), both of which are generally scarcely (if at all) available for languages other than English. As a reference, we have included the best performing unsupervised system for both Spanish and Italian (we will refer to this baseline as BestUns). This unsupervised baseline is based on the distributional models described in Shwartz, Santus, and Schlechtweg (Reference Shwartz, Santus and Schlechtweg2017).
As such, we report experiments (Table 6) with training data only from English (11,779 hyponym-hypernym pairs), and enriched models informed with relatively few training pairs (500, 1K, and 2K) from the target languages. Evaluation is conducted with the same metrics as in the original SemEval task, that is, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP) and precision at 5 ( $P@5$ ). Specifically, MRR rewards the position of the first correct retrieved hypernym:
where Q is a sample of experiment runs and $rank_i$ refers to the rank position of the first relevant outcome for the ith run. However, in this hypernym discovery data set, the vast majority of terms accept more than one correct hypernym, which is why MAP was considered as the official task metric in the SemEval task. This metric is defined as follows:
where AP (Average Precision) is the average of the $P@{K_1},...,P@{K_n}$ scores, where $K_1,...,K_n$ are the positions where the gold hypernyms appear in the ranking. As the maximum number of hypernyms allowed per term was 15, we only consider the first 15 gold hypernyms in cases where there are more.
We report comparative results between the following systems: VecMap $_\textrm{uns}$ (the unsupervised variant), VecMap $_\textrm{ortho}$ (the orthogonal transformation variant), VecMap $_{multi-step}$ (the supervised multi-stage variant) and three Meemi variants: Meemi (VecMap); Meemi $_\textrm{w}$ (VecMap) and Meemi-multi (VecMap). The first noticeable trend is the better performance of the unsupervised VecMap version versus its supervised orthogonal and multi-step counterparts. Nevertheless, we find remarkably consistent gains over both VecMap variants when applying Meemi, across all configurations for the two language pairs considered. In fact, the weighted (Meemi $_\textrm{w}$ ) version brings an increase in performance between 1 and 2 MRR and MAP points across the whole range of target language supervision (from zero to 2k pairs). This is in contrast to the instrinsic evaluation, where the weighted model did not seem to provide noticeable improvements over the plain version of Meemi. Finally, concerning the fully multilingual model, the experimental results suggest that, while still better than the orthogonal baselines, it falls short when compared to the weighted bilingual version of Meemi. This result suggests that exploring weighting schemes for the multilingual setting may bring further gains, but we leave this extension for future work.
6.2 Cross-lingual natural language inference
The task of natural language inference (NLI) consists in detecting entailment, contradiction or neutral relations in pairs of sentences. In our case, we test a zero-shot cross-lingual transfer setting where a system is trained with English corpora and is then evaluated on a different language. We base our approach on the assumption that better aligned cross-lingual embeddings should lead to better NLI models, and that the impact of the input embeddings may become more apparent in simple methods; as opposed to, for instance, complex neural network architectures. Hence, and also to account for the coarser linguistic granularity of this task (being a sentence classification problem rather than word level), we employ a simple bag-of-words approach where a sentence embedding is obtained through word vector averaging. We then train a linear classifierFootnote g to predict one of the three possible labels in this task, namely entailment, contradiction or neutral. We use the full MultiNLI English corpus (Williams, Nangia, and Bowman Reference Williams, Nangia and Bowman2018) for training and the Spanish and German test sets from XNLI (Conneau et al. Reference Conneau, Rinott, Lample, Williams, Bowman, Schwenk and Stoyanov2018b) for testing. For comparison, we also include a lower bound obtained by considering English monolingual embeddings for input; in this case, FastText trained on the UMBC corpus, which is the same model used to obtain multilingual embeddings.
Accuracy results are shown in Table 7. The main conclusion in light of these results is the remarkable performance of the unsupervised VecMap model and, most notably, multilingual Meemi for both Spanish and German, clearly outperforming the orthogonal bilingual mapping baseline. Our results are encouraging for two reasons. First, they suggest that, at least for this task, collapsing several languages into a unified vector space is better than performing pairwise alignments. And second, the inherent benefit of having one single model accounting for an arbitrary number of languages.
7. Analysis
We complement our quantitative (intrinsic and extrinsic) evaluations with an analysis aimed at discovering the most salient characteristics of the transformation that is found by Meemi. We present a qualitative analysis with examples in Section 7.1, as well as an analysis on the impact of the size of training dictionaries in Section 7.2 and on the performance of the multilingual model in Section 7.3.
7.1 Studying word translations
Table 8 lists a number of examples where, for a source English word, we explore its highest ranked cross-lingual synonyms (or word translations) in a target language. We select Spanish as a use case.
Let us study the examples listed in Table 8, as they constitute illustrative cases of linguistic phenomena which go beyond correct or incorrect translations. First, the word ‘crazy’ is correctly translated by both VecMap and Meemi; loco (masculine singular), locos (masculine plural) or loca (feminine) being standard translations, with no further connotations, of the source word. However, the most interesting finding lies in the fact that for Meemi-multi, the preferred translation is a colloquial (or even vulgar) translation which was not considered as correct in the gold test dictionary. The Spanish word chifladas translates to English as ‘going mental’ or ‘losing it’. Similarly, we would like to highlight the case of ‘telegraph’. This word is used in two major senses, namely to refer to a message transmitter and as a reference to media outlets (several newspapers have the word ‘telegraph’ in their name). VecMap and Meemi (correctly) translate this word into the common translation telégrafo (the transmission device), whereas Meemi-multi prefers its named-entity sense.
Other cases, such as ‘conventions’ and ‘discover’, are examples to illustrate the behaviour for common ambiguous nouns. In both cases, candidate translations are either misspellings of the correct translation (descubr for ‘discover’) or misspellings involving tokens conflating two words whose compositional meaning is actually a correct candidate translation for the source word; for example, legislaciones nacionales (‘national rulings’) for ‘conventions’. Finally, ‘remark’ offers an example of a case where ambiguity causes major disruptions. In particular, ‘remark’ translates in Spanish to observación, which in turn has an astronomical sense; ‘astronomical observatory’ translates to observatorio astronómico.
7.2 Impact of training dictionary and corpus size
Our method relies on the availability of suitable bilingual training dictionaries, where we can expect that the size of these dictionaries should have a clear impact on the quality of the final transformation. This is analysed in Figure 2 for the task of cross-lingual word similarity. The figure shows the absolute improvement (in percentage points) over VecMap by applying Meemi, using different training dictionary sizes for supervision.
As can be observed, using Meemi improves the results, for all language pairs, when dictionaries of 8K, 5K or 3K word pairs are used, but its performance heavily drops with dictionaries of smaller sizes (i.e. 1K and especially 100). In fact, having a larger dictionary helps avoid overfitting, which is a recurring problem in cross-lingual word embedding learning (Zhang et al. Reference Zhang, Liu, Luan and Sun2017a). The most remarkable case is that of Farsi, where Meemi improves the most, but where access to a sufficiently large dictionary becomes even more important. This behaviour clearly shows under which conditions our proposed final transformation can be applied with higher success rates. We leave exploring larger dictionaries and their impact in different tasks and languages for future work.
On the other hand, we have observed that while corpus size plays a role in the performance of our models, it is not as notable as it might seem at first. Given the different corpus sizes of the data we used to train our monolingual embeddings, we analysed the correlation between these sizes, mentioned in Section 4.1, and the performance figures presented in Tables 1, 3 and 5. The average Pearson correlation across multilingual models in dictionary induction, where all languages are available, is 0.38 (discarding VecMap $_\textrm{uns}$ due to its anomalous results for Finnish), while for cross-lingual and monolingual word similarity it is 0.69 and 0.65, respectively. Note, however, that in these latter cases we are missing two distant languages, that is, Finnish and Russian.
7.3 Multilingual performance
In this section we assess the benefits of our proposed multilingual integration (cf. Section 3.2). To this end, we measure fluctuations in performance as more languages were added to the initially bilingual model. Thus, starting from a bilingual embedding space obtained with VecMap $_\textrm{ortho}$ , we apply Meemi over a number of aligned spaces, which ultimately leads to a fully multilingual space containing the following languages: Spanish, Italian, German, Finnish, Farsi, Russian and English. This latter language is used as the target embedding space for the orthogonal transformations due to it being the richest in terms of resource availability.
To avoid a lengthy and overly exhaustive approach where all possible combinations from two to seven languages are evaluated, we opted for conducting an experiment where languages are divided into two groups and added one by one in a fixed order: the first group is formed by languages that obtain the best alignments with English in previous experiments, which broadly coincides with those that are closer to English in terms of language family and alphabet (i.e., Spanish, Italian and German), and then the second group formed by the remaining languages (i.e., Finnish, Farsi and Russian). However, this approach does not allow us to use, for example, the English-Farsi test set until reaching the fifth step. To solve this, if the language that is needed for the test set has not yet been included, we replace the last language that was added by the one that is needed for the test set. For instance, while we normally add Italian as the second source language (resulting in trilingual space en-es-it), for the English-German test set, the results are instead based on a space where we added German instead of English (i.e. the trilingual space en-es-de). In Table 9, we show the results obtained by the multilingual models in bilingual dictionary induction.
The best results are achieved when more than two languages are involved in the training, which correlates with the results obtained in the rest of the tasks and highlights the ability of Meemi to successfully exploit multilingual information to improve the quality of the embedding models involved. In general, the performance fluctuates more significantly when adding the first language to the bilingual models and then stabilises at a similar level to the bilingual case when adding more distant languages.
8. Conclusion
In this article, we have presented an extended study of Meemi, a simple post-processing method for improving cross-lingual word embeddings which was first presented in Doval et al. (Reference Doval, Camacho-Collados, Espinosa-Anke and Schockaert2018). Our initial goal was to learn improved bilingual alignments from those obtained by state-of-the-art cross-lingual methods such as VecMap (Artetxe et al. Reference Artetxe, Labaka and Agirre2018a) or MUSE (Conneau et al. Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a). We do this by applying a final unconstrained linear transformation to their initial mappings. Our extensive evaluation reveals that Meemi, using only dictionary translation as supervision, can improve on the supervised and unsupervised variants of these models, in both close and distant languages. This also confirms findings from recent work that unsupervised models may be more brittle than supervised models, even if these are using only word translations as supervision (Vulić et al. Reference Vulić, Glavaš, Reichart and Korhonen2019).
In this work, we have also gone beyond the bilingual setting by exploring an extension of the original Meemi model to align embeddings from an arbitrary number of languages in a single shared vector space. In particular, we take advantage of the fact that, assuming the initial alignment was obtained with an orthogonal mapping, Meemi can naturally be applied to any number of languages through a single linear transformation per language.
Regarding the evaluation, we extended the language set to include, in addition to the usual Indo-European languages such as English, Spanish, Italian or German, other distant languages such as Finnish, Farsi and Russian. The results we report in this article show that Meemi is highly competitive, consistently yielding better results than competing baselines, especially in the case of distant languages. We are particularly encouraged by the multilingual results, which prove that bringing together distant languages from different families in a shared vector space appears to be beneficial in most cases.
9. Future work
We will continue to explore the possibilities of post-processing multilingual models, investigating their impact in different tasks. Given the fact that going from restrictive orthogonal transformations to the less constrained Meemi transformation was found to be beneficial in the integration of monolingual models, it remains to be seen whether there are benefits in further fine-tuning the alignment, in the form of some kind of constrained non-linear transformation.
Given the recent breakthroughs in multilingual contextualized language models such as mBERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), we also plan on exploring the use of static (i.e., non-contextualized) cross-lingual word embeddings as prior knowledge for those models, as was suggested by Artetxe et al. (Reference Artetxe, Ruder and Yogatama2020) (see ending of Section 1). More specifically, instead of freezing the pretrained input embeddings when training the contextualized model, it would be interesting to analyse the effect of updating the parameters of the cross-lingual word vectors jointly with the rest of the language model. An advantage of our cross-lingual vectors, compared to the ones that were considered by Artetxe et al. (Reference Artetxe, Ruder and Yogatama2020), is that we can train them on a wider range of languages (i.e., not just bilingual), which would allow for a more comprehensive exploitation of multilingual training corpora.
Acknowledgements
Yerai Doval has been supported by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO) through the ANSWER-ASAP project (TIN2017-85160-C2-2-R); by the Spanish State Secretariat for Research, Development and Innovation (which belongs to MINECO) and the European Social Fund (ESF) through a FPI fellowship (BES-2015-073768) associated to TELEPARES project (FFI2014-51978-C2-1-R) and by the Xunta de Galicia through TELGALICIA research network (ED431D 2017/12). This work was partly supported by ERC Starting Grant 637277.