1. Introduction
Significant development of natural language processing (NLP) applications has been realized due to the dramatic increase in the availability of pre-trained models. Many popular models such as BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019), GPT-3 (Brown et al. Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess, Clark, Berner, McCandlish, Radford, Sutskever and Amodei2020), and T5 (Raffel et al. Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020) have shown promising results in solving important NLP problems like question-answering (QA), summarization, machine translation, sentiment analysis, etc. (Liu, Chen, and Xu Reference Liu, Chen and Xu2022; Ofoghi, Mahdiloo, and Yearwood Reference Ofoghi, Mahdiloo and Yearwood2022; Suleiman and Awajan Reference Suleiman and Awajan2022; Yadav et al. Reference Yadav, Gupta, Abacha and Demner-Fushman2022; Zhu Reference Zhu2021b). The availability of such pre-trained models has reduced the requirement for task-specific supervised data, expensive hardware, and large training time. As a result, fine-tuning on a small supervised set is enough for achieving the benchmark results (Zhu Reference Zhu2021a; Zhang et al. Reference Zhang, Lai, Feng and Zhao2021). The resource-rich languages (RRLs) like English, German, French, etc. are benefited from pre-trained models even in the zero-shot setup (Kaur, Pannu, and Malhi Reference Kaur, Pannu and Malhi2021).
On the contrary, the low-resource languages (LRLs) are struggling, with the lack of supervised task-specific data being its bottleneck (Pandya and Bhatt Reference Pandya and Bhatt2021). Figure 1 shows the comparison between a number of Wikipedia articles in English(en) and Indic languages (Hindi(hi), Bengali(bn), and Telugu(te)).Footnote a Based on this analysis, it is apparent that the available monolingual corpora for LRLs (2.2 percent for Hindi, 1.7 percent for Bengali, and 1.1 percent for Telugu) are negligible compared to English (95.0 percent). The training of language models (LM) requires a huge corpus (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019; Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020). This demands a need of developing LM learning techniques for LRLs that can be leveraged from RRLs.
The transfer-learning mechanism has improved the ability of LRLs when combined with multilingual models like mBERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), XLM-R (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020), and IndicBERT (Kakwani et al. Reference Kakwani, Kunchukuttan, Golla, G., Bhattacharyya, Khapra and Kumar2020). The authors (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020) have shown that the monolingual models can provide multilingual-compatible performance for LRLs using unsupervised low-resource data and supervised task-specific data in RRLs. They have improved the performance of the monolingual models for LRLs to be comparable to the multilingual models. The other recent approaches to support LRLs in multilingual model environments include a transliteration of LRLs into the related prominent language (Khemchandani et al. Reference Khemchandani, Mehtani, Patil, Awasthi, Talukdar and Sarawagi2021), utilizing overlapping tokens in the embedding learning of LRLs (Pfeiffer et al. Reference Pfeiffer, Vulić, Gurevych and Ruder2021), model fine-tuning with expanded vocabulary (Wang, Mayhew, and Roth Reference Wang, Xie, Xu, Yang, Neubig and Carbonell2020), and light-weight adapter layer training with a fixed pre-trained model (Pfeiffer et al. Reference Pfeiffer, Vulić, Gurevych and Ruder2020b). The impact of linguistic family and the use of RRL family members for LRL acquisition demand further investigation. A distinctive departure in our methodology compared to the previously mentioned methodologies lies in our pursuit of a joint model training strategy across cognate languages, followed by task-specific refinement through fine-tuning using the same familial language. Additionally, our model not only leverages annotated data from the RRL but also incorporates data from the family-linked low-resource language (family-LRL) to enhance performance in downstream QA task.
The Indic languages, Hindi and Bengali, belong to the same language family and thus share the genealogical similarity of the Indo-Iranian group, whereas despite being part of the Dravidian language family, the embedding representation for Telugu (Tibeto-Burman group) is nearer to Hindi (Kudugunta et al. Reference Kudugunta, Bapna, Caswell and Firat2019) due to its typological similarity (Littell et al. Reference Littell, Mortensen, Lin, Kairis, Turner and Levin2017). Moreover, all three languages share the same subject-object-verb order in contrast to the subject-verb-object order of English. The evolutionary and geographical proximity generates high words sharing ratio (Khemchandani et al. Reference Khemchandani, Mehtani, Patil, Awasthi, Talukdar and Sarawagi2021) among these languages vis-à-vis the word overlapping ratio with English. In this work, we have investigated the impact of Bengali and Telugu languages on Hindi zero-shot and few-shot question-answering.
To observe the language structure of the Indic languages, for the given question, we compared the sequence and the position of answer words in the context paragraph for all the languages. One such example from the XQuAD dataset is shown in Table 1. The answer words in Hindi are closer to the answer words of translated Bengali and Telugu statements. However, in the English language, the answer words are usually found at the end of the sentence. For all the languages, we have underlined the overlapping words between the question and context statement. We have also plotted parallel Hindi, Bengali, Telugu, and English words in the embedding space to see how far their embedding representations are. The observations and analysis section of our paper contains a detailed analysis of embedding distance charts.
Note: Words that appear in both the question and context are underlined, and similar introductory sentence structures are highlighted in red across all languages. The expected answer is highlighted with a yellow background.
In this article, we examine the suitability of using multilingual task-specific training to improve the performance of monolingual QA tasks. In the absence of sufficient supervised monolingual corpus, the performance of the system is not up to the mark. In order to solve the problem of small corpus size, we propose a way of utilizing the combination of an unsupervised corpus of more than one language. To investigate the impact of combining the data of multiple languages and language relatedness, in this work, we show that, with small task-specific supervision of language from the same language family, the performance of multilingual models can be improved. Specifically, we have performed our experiments using QA data of three Indic languages (Hindi, Bengali, and Telugu) from TyDi (Clark et al. Reference Clark, Choi, Collins, Garrette, Kwiatkowski, Nikolaev and Palomaki2020), MLQA (Lewis et al. Reference Lewis, Oguz, Rinott, Riedel and Schwenk2020), and XQuAD (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020) datasets. Our experiments are conducted on pre-trained XLM-R, mBERT, and IndicBERT transformer models. For all the experiments we have used English as RRL.
Our major contributions are the following:
-
(1) To address the data scarcity problem in task-specific learning, we have proposed an idea of learning a task for LRL (Hindi) using supervised data of RRL (English) and supervised task-specific data of another LRL (Bengali/Telugu) which has structural and grammatical similarity with Hindi.
-
(2) In our experiments, we have analyzed the impact of the few-shot task learning sequence swapping between LRL (Hindi) and another LRL (Bengali/Telugu). This experiment enables us to observe the behavior and how the order of language impact overall learning while fine-tuning the model with multiple languages.
-
(3) The generalized nature of the proposed approach allows it to be extended for any transformer architecture. To verify this behavior, we conducted our experiments on XLM-R, mBERT, and IndicBERT, all of which follow different architectures.
2. Related work
The development of multilingual transformer models like mBERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), XLM-R (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020), and IndicBERT (Kakwani et al. Reference Kakwani, Kunchukuttan, Golla, G., Bhattacharyya, Khapra and Kumar2020) has shown significant performance gains on LRLs. The approach of decoupled encoding to identify related subwords is explored by Wang et al. (Reference Wang, Pham, Arthur and Neubig2019). The recent articles (Hu et al. Reference Hu, Ruder, Siddhant, Neubig, Firat and Johnson2020; Lauscher et al. Reference Lauscher, Ravishankar, Vulić and Glavaš2020) revealed that cross-lingual transfer cannot be achieved by offering joint training for LRL and RRL due to the model’s incapacity to accommodate several languages at the same time.
The research of linguistic relatedness begins with a translated parallel corpus and cross-lingual performance training based on concatenated parallel data (CONNEAU and Lample (Reference Conneau and Lample2019)). Efforts have been made to overcome the ordinary performance due to poor translation (Goyal, Kumar, and Sharma Reference Goyal, Kumar and Sharma2020; Khemchandani et al. Reference Khemchandani, Mehtani, Patil, Awasthi, Talukdar and Sarawagi2021; Song et al. Reference Song, Dabre, Mao, Cheng, Kurohashi and Sumita2020) by adopting the aspect of RRL transliteration to generate parallel data in LRL. Chung et al. (Reference Chung, Garrette, Tan and Riesa2020) proposed the direction of multilingual task learning using weighted clustered vocabulary. Cao, Kitaev, and Klein (Reference Cao, Kitaev and Klein2020) and Wu and Dredze (Reference Wu and Dredze2020) explored altering the direction of contextual embedding by bringing the embedding of the aligned words closer together to achieve efficient cross-lingual transfer.
Wang et al. (Reference Wang, Xie, Xu, Yang, Neubig and Carbonell2020) propose that monolingual LRL performance can be improved by increasing embedding layers with LRL-specific weights. Adding the language adapters and task adapters (Pfeiffer et al. Reference Pfeiffer, Vulić, Gurevych and Ruder2020a, Reference Pfeiffer, Vulić, Gurevych and Ruderb; Houlsby et al. Reference Houlsby, Giurgiu, Jastrzebski, Morrone, De Laroussilhe, Gesmundo, Attariyan and Gelly2019) over transformer models to boost the performance of LRLs is the recent approach of performance improvement by small fine-tuning (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020; Pandya, Ardeshna, and Bhatt Reference Pandya, Ardeshna and Bhatt2021; Üstün et al. Reference Üstün, Bisazza, Bouma and van Noord2020). The approach of tokenizer learning for LRL and utilizing lexical overlap between LRL and RRL in embedding is adopted by Pfeiffer et al. (Reference Pfeiffer, Vulić, Gurevych and Ruder2021).
2.1 Embedding learning for the low-resource setup
For embedding learning, it is exceedingly difficult to obtain monolingual data for a majority of languages in the Indian language family. Hence, the monolingual embeddings for these languages are usually of poor quality (Michel, Hangya, and Fraser Reference Michel, Hangya and Fraser2020). Eder, Hangya, and Fraser (Reference Eder, Hangya and Fraser2021) suggested an embedding that starts with a small bilingual seed dictionary and pre-trained monolingual embeddings of the RRLs. Adams et al. (Reference Adams, Makarucha, Neubig, Bird and Cohn2017) demonstrated that training monolingual embedding for LRLs and RRLs together improves the monolingual embedding quality of LRLs. Lample et al. (Reference Lample, Ott, Conneau, Denoyer and Ranzato2018b) trained the fastText skipgram embeddings (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017) to learn the joint embedding of the source and the target languages using joint corpora. Vulić et al. (Reference Vulić, Glavaš, Reichart and Korhonen2019) showed that the unsupervised approach ( Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2018) cannot efficiently handle LRLs and multiple distant.
Using adversarial training, Zhang et al. (Reference Zhang, Liu, Luan and Sun2017) demonstrated that monolingual alignment is possible without bilingual data. Lample et al. (Reference Lample, Conneau, Ranzato, Denoyer and Jégou2018a) combined the Procrustes analysis refinement and adversarial training to obtain an unsupervised mapping. The bottleneck with the mapping approaches lies in its dependence on high-quality monolingual embedding spaces. The approach reported by Wang et al. (Reference Wang, K, Mayhew and Roth2020) benefits from the conjunction of joint and mapping methods, where initially they trained combined monolingual datasets. Subsequently, they map the source and target embedding after reallocation of the oversharing vocabularies.
2.2 Language relatedness
The idea of using RRLs to improve LRLs is to reduce the need for supervised data in the LRL. The authors (Nakov and Ng Reference Nakov and Ng2009) have proposed the statistical machine translation model, which requires a few parallel samples of source LRL and target RRL in addition to a large parallel corpus of target RRL and another RRL, which is related to the target RRL. During the transfer learning from RRL to LRL, Nguyen and Chiang (Reference Nguyen and Chiang2017) exploited the shared word embeddings.
Until now, only a few works have considered using information from related RRLs for low-resource embeddings (Woller, Hangya, and Fraser Reference Woller, Hangya and Fraser2021). Many researchers have looked at the idea of joint training, which either necessitates a huge training corpus or is reliant on pre-trained monolingual embedding (Ammar et al. Reference Ammar, Mulcaire, Tsvetkov, Lample, Dyer and Smith2016; Alaux et al. Reference Alaux, Grave, Cuturi and Joulin2019; Chen and Cardie Reference Chen and Cardie2018; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019; Heyman et al. Reference Heyman, Verreet, Vulić and Moens2019).
A major distinction in our approach from the above approaches is that we have looked into the direction of joint model training on related languages followed by task-specific fine-tuning with the family language. Instead of independently addressing individual languages, our approach capitalizes on the inherent linguistic relationships between related languages. By collectively training a model across related languages, we harness the potential for cross-lingual knowledge transfer, allowing the model to develop a more comprehensive understanding of underlying linguistic structures and nuances shared within the language family. This approach aligns with the linguistic theory that languages with common ancestry exhibit similarities in syntax, semantics, and other linguistic attributes. Furthermore, the subsequent task-specific fine-tuning using the same family language aligns with the idea that linguistic features and patterns learned during the joint training phase can be further refined to suit the specific tasks at hand. This two-step process not only capitalizes on the advantages of cross-lingual training but also tailors the model to effectively address domain-specific challenges within the context of the selected family language.
In addition, our approach depends on MLM training using family languages to establish the word correlation between family languages. Here, both languages are of the LRL category; hence the joint MLM training before fine-tuning on downstream task and customized learning framework obviates the requirement for corpus translation or transliteration, thereby reducing the laborious processes of data labeling, and the adjustment of start and end tokens inherent in standard QA-supervised datasets.
3. Proposed approach of cross-lingual language learning
This section describes our approach to transfer knowledge from one language (L2, Bengali/Telugu in this case) to another (L1, Hindi in this case). Here, L1 and L2 both fall under the category of LRLs with monolingual unsupervised datasets available. Additionally, a training corpus for the downstream task is also available for L2. The proposed method is summarized in the following steps and explained in detail in the subsequent paragraphs:
-
(1) Joint training of an embedding layer of pre-trained transformer models on L1 and L2 with masked language modeling (MLM) objective and unlabeled text corpora.Footnote b Here, pre-training is heavily biased on RRLs (L3). During the MLM training, except embedding, all other layers are kept frozen.
-
(2) Fine-tune the model on downstream tasks using labeled data in L3 and keeping embedding frozen. The result of this phase is that the model retains the acquired lexical representations without alterations, while also improving its capacity to obtain precise answers.
-
(3) Further fine-tune the model on downstream tasks using labeled data of L2 and L1. During this step, the embedding layer is again kept frozen. Here, we have analyzed the impact of changing the learning sequence from L2-L1 to L1-L2 in a few-shot setup.
-
(4) In a transfer-learning configuration, replace the embedding layer of the above setup with the embedding layer learned in step (1).
-
(5) Evaluate the model performance of L1 on downstream task using a test dataset of L1.
As shown in Fig. 2, we trained an embedding layer of a pre-trained transformer model with an MLM objective. During this training phase, the model is provided the input token vectors with some of the tokens randomly replaced with a specific (MASK) token. The objective here is for the model to predict the most suitable token from the vocabulary for the (MASK) token. To comprehend the language statistically, it is helpful to employ models that are trained utilizing MLM objective. We performed MLM training as the first step in our architecture, to get benefited from the nearness of similar token representation in a language family. Specifically, for our customized training architecture, during step (1), unsupervised data of Hindi (L1) and one of the family languages (Bengali/Telugu) (L2) are supplied with a 15 percent masking probability. Random statements from L1 and L2 are given to the model during MLM training to help it understand the commonality between these language families. Except for embedding, weights of other layers are kept unchanged as the objective here is to learn the embedding representation of the language family.
To learn the question-answering task, in step (2), we have fine-tuned our model using the SQuAD English dataset (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016). The objective of this stage is to preserve the acquired lexical representations in an unaltered state while concurrently augmenting the model’s capacity for obtaining the accurate answers.
For the zero-shot learning setup in step (3), the fine-tuning is performed on L2 TyDi Bengali/Telugu dataset (Clark et al. Reference Clark, Choi, Collins, Garrette, Kwiatkowski, Nikolaev and Palomaki2020). In a few-shot setup, we have further fine-tuned the QA learning using the MLQA Hindi dataset (Lewis et al. Reference Lewis, Oguz, Rinott, Riedel and Schwenk2020). The rationale behind the implementation of this step is to strike a balance between capitalizing on the model’s existing downstream task knowledge gained through the English QA learning (step (2)) and refining its linguistic proficiency in answering the question using family language L2.
In all QA learning steps, the embedding layer was kept frozen, that is, only all transformer layer parameters got updated except for the embedding layer during this task learning phase. Hence, the word-relatedness learned during step (1) would not be affected by the task learning phase.
At the end of step (3), in zero-shot setup, we have (i) an embedding layer inclined toward languages relatedness of L1 and L2 and (ii) a transformer model fine-tuned for a downstream task using L3 and L2. Both these layers are combined to measure the performance of the target LRL language. So, during the evaluation phase, in step (4), the embedding layer of the model trained on the QA task is replaced with the embedding layer trained in the first step. The new transferred architecture is evaluated on the Hindi test dataset.Footnote c
In all our experiments, the special transformer symbols ((CLS), (SEP), (UNK), and (MASK)) are shared among all languages.Footnote d
3.1 Models
We conducted our experiments on XLM-R, mBERT, and IndicBERT to determine the impact of Bengali and Telugu on Hindi. (All three languages are part of the Indian subcontinent and categorized as LRLs.) The mBERT model is pre-trained on 104 languages, whereas XLM-R is pre-trained on 100 languages. The training set of both includes Hindi, Bengali, and Telugu languages. IndicBERT is a multilingual ALBERT model that has been trained in a total of twelve Indian languages, including Hindi, Bengali, and Telugu. The reason for including the IndicBERT model in our study is that it has already been trained in a variety of Indian languages. Due to this, the influence of Indian-continent languages seems to be higher compared to the influence of RRLs on IndicBERT. As a result, the model’s lexicon is dominated by Indian languages. We have trained the following models for zero-shot setup using XLM-R $_{base}$ ,XLM-R $_{Large}$ mBERT, and IndicBERT:
-
MODEL-xxMLM-SQuADFootnote e : The model is trained on a Wikipedia dumpFootnote f of Hindi and Bengali/Telugu languages on MLM objective followed by task learning using the SQuAD dataset.
-
MODEL-xxMLM-SQuAD-TyDi : The MODEL-xxMLM-SQuAD model is further trained using one of the LRLs (L2) as mentioned in the proposed approach.
-
MODEL-xxMLM-SQuAD-TyDi-NotFreeze : The model training and parameters are the same as in MODEL-xxMLM-SQuAD-TyDi, with the exception that the embedding parameters are not frozen during TyDiQA training. The reason for using this paradigm in our research is that the LRL training focuses on a member of the target language’s language family; combining it with embedding training with task learning can enhance performance. However, in the performance, we have not observed a significant impact of this change. So, we have excluded this step in the training of few-shot learning models.
For the few-shot learning, we have used Hindi dataset from MLQA (Lewis et al. Reference Lewis, Oguz, Rinott, Riedel and Schwenk2020) in addition to the datasets used in the zero-shot setup. We have trained the following models for a few-shot setup using XLM-R $_{Large}$ , mBERT, and IndicBERT:
-
MODEL-xxMLM-SQuAD-MLQAFootnote g: The model is trained on a Wikipedia dump of Hindi and Bengali/Telugu languages on MLM objective followed by task learning using the SQuAD dataset and few-shot task learning on the MLQA dataset.
-
MODEL-xxMLM-SQuAD-MLQA-TyDi: MODEL-xxMLM-SQuAD model further trained using one of the LRLs (L2) as mentioned in the proposed approach.
-
MODEL-xxMLM-SQuAD-TyDi-MLQA: In this training setup, the arrangement of few-shot learning and language family learning is reverse of the above model.
4. Experimental setup
4.1 Model setup
For training our model, we use Adam optimizer with learning rate as 2e $-$ 5 and adam_epsilon = 1e-8. All the fine-tunings are performed with a single epoch and batch size 4. Common values for all the transformer architecture include warmup_proportion = 0.1, weight_decay = 0.01, intialize_range = 0.02, max_position_embeddings = 512, hidden_act = glue, position_embedding_ type = absolute, max_seq_length for MLM = 128, max_seq_length for QA = 384, doc_stride = 128, n_best_size = 20, and max_answer_length = 30. In addition to these hyperparameters, Table 2 indicates the architecture-specific values of other hyperparameters.
Additionally, by following the standard QA training setup, we are truncating context only when the combined length of question and context is going beyond the model size. The question with the remaining tokens of context will be given to the model in the next training sample. Along with all predictions, we are returning offset mapping to map token indices. It enables us to find the end token position depending on the predicted start token and the length of the answer.
4.2 Datasets
The datasets used in our experiments are categorized into two categories: (1) unlabelled text data for MLM objective and (2) question-answering data in the SQuAD format. In this section, the details of our dataset are mentioned for both categories:
4.2.1 Unlabelled data for MLM objective
To perform the embedding learning for low-resource languages (Hindi, Bengali, and Telugu in our case), we have combined Wikipedia dump, IndicCorp (Kakwani et al. Reference Kakwani, Kunchukuttan, Golla, G., Bhattacharyya, Khapra and Kumar2020), and LRL part from Samanantar parallel Indic corpora collection (Ramesh et al. Reference Ramesh, Doddapaneni, Bheemaraj, Jobanputra, AK, Sharma, Sahoo, Diddee, J, Kakwani, Kumar, Pradeep, Deepak, Raghavan, Kunchukuttan, Kumar and Khapra2021). Here, in Samanantar, parallel data for Indic languages and English is given. Samanantar also contains parallel data between Indic languages. From that dataset, we have used individual (without a parallel corpus) Hindi, Bengali, and Telugu data and combined it with Wikipedia dump and IndicCorp for each language. The total number of sentences per LRL language used for our experiments is shown in Table 3.
4.2.2 Task-specific data for question-answering objective
Initial task-specific training is performed on RRL (English) SQuAD 1.1 (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016). To train the model further on LRL downstream task, we have used TyDiQA secondary task (Clark et al. Reference Clark, Choi, Collins, Garrette, Kwiatkowski, Nikolaev and Palomaki2020) dataset for Bengali (bn) or Telugu (te) languages depending on the chosen language of family-LRL. In the few-shot setup, the MLQA dataset (Lewis et al. Reference Lewis, Oguz, Rinott, Riedel and Schwenk2020) is used to train the model on Hindi (hi) QA task. All of our models are evaluated on the XQuAD (Artetxe, Ruder, and Yogatama Reference Artetxe, Ruder and Yogatama2020) Hindi dataset which contains 240 paragraphs and 1,190 question-answer pairs.
Note: "xx" indicates the fine-tuning language (be/te), as specified in the column header. Models are trained jointly on Hindi+Bengali or Hindi+Telugu, followed by QA task learning. Results with $\dagger$ are taken from the XQuAD paper (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020) for comparison purpose.
5. Observations and analysis
This section encompasses our empirical investigation pertaining to lexical representations of parallel data across chosen LRL languages (Hindi, Bengali and Telugu) and RRL (English), accompanied by a discourse on the outcomes acquired through both the few-shot and zero-shot setups with family-LRL (Bengali/Telugu).
5.1 Analysis of the embedding representation of corresponding terms
To analyze the embedding representation of parallel words of Hindi, Bengali, Telugu, and English, we have randomly chosen a parallel Hindi and English context paragraph from the XQuAD dataset. For generating Telugu and Bengali parallel statements, we have translated the Hindi context paragraph. The authors (Abnar and Zuidema Reference Abnar and Zuidema2020) have analyzed that information originating from different tokens gets increasingly mixed across the layers of the transformer. Hence, instead of looking at only the raw attention in a particular layer, we have considered a weighted flow of information from the input embedding to the particular hidden output.
To interpret the effect of specific word representation, we have extracted output vectors of the first, second, penultimate, and last layers. Additionally, we have calculated the average and concatenated vectors from the last four layers to retrieve combined vectors. Next, we calculated the cosine distance between them in the higher dimension to observe the proximity of these vectors. Finally, using the multidimensional scaling technique, we have plotted the distance matrix. Figure 3 represents our approach to data representation.
Figure 4 indicates the representation results obtained using the mBERT model, which was trained on an unsupervised Hindi+Bengali dataset with an MLM objective. Here, the average and concatenated result shows that, due to the joint training and language proximity, the distance between the Bengali and Hindi words is less compared to Telugu and English. Moreover, despite Telugu language was not given in MLM training, the representation of Telugu words is nearer compared to English. This result also adds the attestation to the Indic family languages and relatedness concept.
Note: "xx" indicates the MLM fine-tuning language (be/te), as specified in the column header. Models undergo the MLM fine-tuning jointly on Hindi+Bengali or Hindi+Telugu followed by QA task learning. Results with $\dagger$ are taken from original XQuAD paper (Artetxe et al. Reference Artetxe, Ruder and Yogatama2020) for comparison purpose.
Note: "xx" indicates the fine-tuning language (be/te), as specified in the column header. Models are trained jointly on Hindi+Bengali or Hindi+Telugu followed by QA task learning.
5.2 Performance observation in the context of utilizing Bengali as the family language
Our observations for the few-shot setup are shown in Table 4. While considering Bengali as the family language, we have noted that few-shot learning followed by language family learning improves the benchmark F1 score of mBERT by 6.42 percent. However, by changing the sequence of language family learning and few-shot learning, the improvement in the F1 score is 10.86 percent, which is 4.44 percent better than the preceding setup. Similar results are obtained for XLM-R $_{Large}$ where the language family learning followed by a few-shot improves the performance by 5.04 percent compared to the reverse learning setup.
Note: "xx" indicates fine-tuning language, as specified in the column header. Models are trained jointly on Hindi+Bengali or Hindi+Telugu followed by QA task learning.
5.3 Performance observation in the context of utilizing Telugu as the family language
While considering Telugu as the family language, the sequence family learning followed by a few-shot improves the mBERT result by 1.76 percent and XLM-R $_{Large}$ result by 5.38 percent. All these analyses show the importance of keeping a few-shot setup at the end of fine-tuning. It is better in tuning the embedding parameter toward the target language. We have also observed that in the zero-shot setup, the joint MLM training helps in achieving performance comparable to the benchmark, even if the task training is achieved using a single epoch of the SQuAD dataset. However, the additional task learning using the family language dataset from TyDiQA degrades the performance of the architecture. Table 5 shows our results on IndicBERT, mBERT, and XLM-R models using Bengali and Telugu languages.
Table 6 indicates the result obtained on the MLQA dataset in the few-shot learning setup. Much similar to the previous case, our combined model shows promising results in the target language. Table 7 shows the performance of our models on the TyDiQA Bengali and Telugu testsets. The results show the positive influence of Hindi + Bengali and Hindi + Telugu MLM on the performance of Bengali/Telugu languages.
Figure 5 indicates the training loss of our IndicBERT, mBERT, and XLM-R $_{Large}$ models. For all three models, training loss indicates the few-shot learning impact of all the steps mentioned in our proposed approach in Section 3.
6. Conclusion
In this article, we proposed an approach to train the model in LRL using supervised data from English and another language belonging to the same family, particularly, for QA task. Our custom learning approach has shown enhanced performance for Hindi LRL when QA fine-tuning is carried out in Bengali and Telugu languages in addition to the task learning using English RRL supervised data. Moreover, the results of our experiments on XLM-R $_{Large}$ , mBERT, and IndicBERT transformer models show that in case of unavailability of parallel data for LRL, joint MLM training with other LRL family language followed by task learning with RRL and family-LRL improves few-shot performance for LRL. However, task learning only using family-LRL seems inefficient due to the limited amount of task-specific data availability in family-LRL.
In a few-shot setup, our best performing XLM-R $_{Large}$ model has achieved 80.54/64.12 (F1 score/EM), while Bengali is used as a family language, whereas 79.74/62.44 (F1 score/EM) is observed with Telugu training. For zero-shot setup, 76.08/60.18 (F1 score/EM) and 75.81/60.17 (F1 score/EM) are the results obtained with Bengali and Telugu, respectively. Using mBERT model in a few-shot setup, 70.06/53.87 (F1 score/EM) is the best score when Bengali language is used as a family language, whereas 69.62/53.36 (F1 score/ EM) is observed with Telugu training. For mBERT in zero-shot setup, 57.87/43.53 (F1 score/EM) and 56.10/43.61 (F1 score/EM) are the results of Bengali and Telugu, respectively. The improvement in the results justifies that with the proposed custom learning approach, learning from the language family is indeed helpful.
Although it is not explored in this paper, we believe that the concept of learning from a language family can be applied to other LRLs as well. The direction of adopting the family-learning technique to other downstream tasks is an avenue for future research.