1. Introduction
Deep neural networks (DNNs) have been extensively adopted in thousands of applications in the past few years (Zhang et al. Reference Zhang, Sheng, Alhazmi and Li2020c); however, several studies have shown the high sensitivity of DNNs to intentional, yet imperceptible perturbations (Szegedy et al. Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2013; Goodfellow, Shlens, and Szegedy Reference Goodfellow, Shlens and Szegedy2014). In the computer vision (CV) domain, generating adversarial examples usually revolves around slightly perturbing the input images so that the perturbations are unperceivable for human eyes. For the natural language processing (NLP) domain, on the other hand, the task is trickier due to the discrete nature of natural languages (Zhang et al. Reference Zhang, Sheng, Alhazmi and Li2020c). This forms a barrier in the face of adversarial attack (AA) advancements in NLP as the methods applied in CV cannot be seamlessly transferred to textual tasks. For example, if term frequency and inverse document frequency are used to represent tokens, whether they are words, sub-words, or characters, using the backpropagated gradients to perturb these representations would potentially lead to invalid letter or word sequences (Zhao, Dua, and Singh Reference Zhao, Dua and Singh2017), which, in turn, would make it almost impossible even for humans to predict the correct output. The same problem emerges when using word embeddings as inputs as the perturbed vectors cannot be matched with any vector in the embedding space.
There are several taxonomies for the AA strategies, where the most popular one is based on model knowledge, i.e. having access to the architecture, weights, training data, loss function, etc. The attacks can be classified based on model knowledge to white-box such as Liang et al. (Reference Liang, Li, Su, Bian, Li and Shi2017); Samanta and Mehta (Reference Samanta and Mehta2018); Rosenberg et al. (Reference Rosenberg, Shabtai, Rokach and Elovici2018); Al-Dujaili et al. (Reference Al-Dujaili, Huang, Hemberg and O’Reilly2018); Cheng et al. (Reference Cheng, Jiang and Macherey2019); Papernot et al. (Reference Papernot, McDaniel, Swami and Harang2016); Sun et al. (Reference Sun, Tang, Yi, Wang and Zhou2018); Ebrahimi et al. (Reference Ebrahimi, Rao, Lowd and Dou2017); Blohm et al. (Reference Blohm, Jagfeld, Sood, Yu and Vu2018) and black-box such as Jia and Liang (Reference Jia and Liang2017); Wang and Bansal (Reference Wang and Bansal2018); Belinkov and Bisk (Reference Belinkov and Bisk2017); Iyyer et al. (Reference Iyyer, Wieting, Gimpel and Zettlemoyer2018). White-box attack algorithms depend mainly on gradients backpropagated by a model with respect to its inputs, where perturbations are made to generate the worst possible examples within a feasible range by moving in the same direction of the gradients (Goodfellow et al. Reference Goodfellow, Shlens and Szegedy2014), meanwhile black-box attacks depend on heuristics such as concatenating to, editing, substituting, or paraphrasing the inputs, and are useful when no access to the model’s parameters is within reach.
Adversarial training (AT), on the other hand, first introduced by Szegedy et al. (Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2013), aims at establishing neural networks robust to AAs. This technique has been proven to be the most effective at developing networks resistant to attacks (Pang et al. Reference Pang, Yang, Dong, Su and Zhu2020 ; Maini, Wong, and Kolter Reference Maini, Wong and Kolter2020; Schott et al. Reference Schott, Rauber, Bethge and Brendel2018). The first work in this domain (Szegedy et al. Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2013) focused on simply training the neural network on a mixture of benign (clean) and malignant (adversarial) examples to achieve this goal. This work opened the doors for a revolution in this field, where several proposed frameworks were designed to enhance the robustness of models against attacks. Goodfellow et al. (Reference Goodfellow, Shlens and Szegedy2014) were the first to propose exploiting the gradients generated by differentiating the loss function with respect to the inputs to generate perturbed examples on the fly via a method called fast gradient sign method (FGSM). However, one of the weaknesses of FGSM is the linear approximation of the loss function that does not improve the model’s vulnerability to iterative attacks (Tramèr et al. Reference Tramèr, Kurakin, Papernot, Goodfellow, Boneh and McDaniel2017). This approximation leads to a sharp curvature known as gradient masking (Papernot et al. Reference Papernot, McDaniel, Goodfellow, Jha, Celik and Swami2017) in the vicinity of data points on the decision surface of the trained models. This issue encouraged further improvements on AAs that targeted designing AA frameworks that simulate the worst possible realistic adversaries as will be discussed later in the Related Works section.
With regard to employing AT in Arabic NLP, there is an apparent shortage in research aiming at building robust deep learning (DL) frameworks. Most of the studies in the literature were targeting developing multi-lingual and cross-lingual DL models that utilize the concept of deep domain adaptation (DDA) proposed by Ganin and Lempitsky (Reference Ganin and Lempitsky2015) to enable knowledge transfer from models trained on high-resource languages to Arabic (Joty et al. Reference Joty, Nakov, Màrquez and Jaradat2017; Zalmout and Habash Reference Zalmout and Habash2019; Gupta Reference Gupta2021; Goyal, Singh, and Kumar Reference Goyal, Singh and Kumar2021).
One of the NLP tasks that have not been covered with AT studies in the Arabic language is sentiment analysis (SA). SA is the task of determining the affective states in a given text (Darwish et al. Reference Darwish, Habash, Abbas, Al-Khalifa, Al-Natsheh, Bouamor, Bouzoubaa, Cavalli-Sforza, El-Beltagy and El-Hajj2021). SA has received great attention among the NLP community as one of the most important applications of the field and has been incorporated in several areas including business analysis (Han et al. Reference Han, Liu, Yang and Jiang2019; Reference Han, Xiao, Wu, Guo, Xu and Wang2021), review analysis (Bose et al. Reference Bose, Dey, Roy and Sarddar2020), healthcare (Clark et al. Reference Clark, James, Jones, Alapati, Ukandu, Danforth and Dodds2018), and stock market analysis (Xing, Cambria, and Welsch Reference Xing, Cambria and Welsch2018). The significance of SA is mainly attributed to the nature of this area of study that aims to automatically analyze the physiological state, satisfaction, or impression of the user based on their responses (Wankhade, Rao, and Kulkarni Reference Wankhade, Rao and Kulkarni2022). This, in turn, would enable organizations and entities to apply improvements in regions that matter to the customers rather than spreading out the expenses on a wide range of potential, yet uncertain areas of improvement.
In this work, we aim to investigate the robustness of models trained on a standard basis against the worst synonym replacement attack. After proving the effectiveness of this attack, we design three scenarios to train robust models. The three scenarios are as follows: applying perturbations to the inputs and training the models on adversarial samples only, applying perturbations on both the inputs and the models’ trainable weights and training based on these perturbations, and training on clean and perturbed inputs in conjunction with the weight perturbation.
The remaining of this work is divided as follows. Section 2 is a literature review of the previous works where we cover four main areas, namely adversarial attack in NLP, adversarial training, using adversarial training in Arabic NLP, and Arabic sentiment analysis. Section 3 contains the materials and methods including a description of the dataset, the data preprocessing steps, sequence tokenization and word embeddings, deep learning models and training settings, evaluation metrics, adversarial attack and evaluation, and adversarial training. Section 4 contains the results and the discussion. In this section, we present the results of both the standard and adversarial training settings. Finally, Section 5 contains the conclusion and future directions.
2. Related works
2.1 Adversarial attack in NLP
In the DL field, AA is the strategy of systematically adding slight perturbations to the inputs of a pre-trained neural network so that the network is driven to generate incorrect predictions for the perturbed inputs. This area of research has been established due to the difficulty in interpreting the outputs of neural networks and defining the kind of knowledge each neuron has learned which, in return, makes the process of evaluating the robustness of a network harder (Zhang et al. Reference Zhang, Sheng, Alhazmi and Li2020c).
The first work that aimed at investigating the robustness of the state-of-the-art image classification DL model was the one proposed by Szegedy et al. (Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2013). The study concluded the over-sensitivity of this model to pixel values as the perturbations added to the images were undetectable for humans; meanwhile, the network was confidently predicting the wrong label for the perturbed images. In order to reduce the cost of generating adversarial examples, Goodfellow et al. (Reference Goodfellow, Shlens and Szegedy2014) have proposed the FGSM which depends on fast generation of adversarial images based on the backpropagated gradients with respect to the inputs.
With regard to the NLP domain, Jia and Liang (Reference Jia and Liang2017) were the first to evaluate DL models on adversarial examples. Their method depended on concatenating distracting, yet meaningless, collection of words to the end of a given text that is used to train a model to solve a question-answering task. One main restriction that has been used to filter out the appended sentences is that the new fooling collection of words must not change the semantics of the main text or alter the answers to the questions. The concatenated sentence can be either a carefully crafted sentence or an arbitrary sequence of tokens randomly selected from a pool of 20 random frequently used tokens. Wang and Bansal (Reference Wang and Bansal2018) proposed expanding the number of fake answers and changing positions where distracting sentences are added in order to build more robust DL models.
Another black-box word-level attack that is based on sentence concatenation is the one proposed by Blohm et al. (Reference Blohm, Jagfeld, Sood, Yu and Vu2018). In this work, the generated sentence was formed by randomly selecting words from all the words in the incorrect answers in conjunction with a pool of 10 random common words and all the words in the question.
Editing the input sequences is another black-box technique used for generating AA. In their work, Belinkov and Bisk (Reference Belinkov and Bisk2017) targeted neural machine translation models. The inputs to these models were perturbed in one of three ways: leveraging typos that already exist in textual datasets, altering the order of all the characters of a word except for the first and last characters, or fully replacing the characters of words with random ones. Niu and Bansal (Reference Niu and Bansal2018) approached the dialogue generation task and used multiple perturbation methods including the following: random swapping by randomly replacing neighboring words, paraphrasing, randomly dropping stop-words, intentionally using the wrong tense form of randomly selected verbs, and using the negations and the antonyms. Gao et al. (Reference Gao, Lanchantin, Soffa and Qi2018) used four strategies in generating adversarial examples, namely addition, deletion, replacement, and swapping. The altered words or characters were selected based on their importance which was measured by means of a scoring function that leverages the classifier’s output. A probabilistic method for generating adversarial examples has been proposed by Ren et al. (Reference Ren, Deng, He and Che2019). They first started by creating a dictionary that contains the synonyms of all the words in an existing corpus. Then, a set of proposed words are substituted with the synonyms that most fool the network such that the replacements make the most impact on the classification probability. The replacement order is defined by means of word saliency. Zhang et al. (Reference Zhang, Zhou, Miao and Li2020a) proposed the Metropolis-Hastings (M-H) Attack. They targeted language models in their work and employed M-H sampling to generate the replacing words and the random words for both the replacement and insertion operations, respectively.
Other works in AA were based on white-box techniques which require the availability of information on the model’s architecture, loss function, inputs, and outputs in order to extract the gradients that can be later used to generate the perturbations. TextFool (Liang et al. Reference Liang, Li, Su, Bian, Li and Shi2017) is one of the leading techniques in the context of textual AA and is inspired by FGSM (Goodfellow et al. Reference Goodfellow, Shlens and Szegedy2014). However, in this work the authors propose the use of the gradient magnitudes rather than their signs to create the adversaries and propose three methods for the AA generation: modification, deletion, and insertion. Based on the absolute backpropagated gradients, the authors define the hot characters, which represent the characters with highest magnitudes and hence highest probability to impact the model. Then, they defined the hot training phrases (HTPs) as phrases that contain enough hot characters with a sufficient frequency of occurrence. During the insertion process, the AA is created by inserting a small proportion of HTPs of the wrong class that is targeted (C $^{\prime }$ ) by the AA in the vicinity of the hot characters that contribute the most to the ground-truth class C. For the modification version, the authors replaced the characters of HTPs with randomly selected characters or characters that are visually similar.
Samanta and Mehta (Reference Samanta and Mehta2018) have shown better results compared to TextFool (Liang et al. Reference Liang, Li, Su, Bian, Li and Shi2017) by performing some modifications to the algorithm. They first start by removing the adverb ( $w_i$ ) that contributes the most to the predicted label. If the generated text contains grammatical mistakes due to the removal operation, a word ( $p_i$ ) is inserted before ( $w_i$ ) where ( $p_i$ ) is selected from a predefined pool that contains synonyms, genre keywords retrieved by means of term frequency, and typos. Finally, if the inserted ( $p_i$ ) fails to increase the value of the loss function, ( $w_i$ ) is replaced with ( $p_i$ ).
In their work, Al-Dujaili et al. (Reference Al-Dujaili, Huang, Hemberg and O’Reilly2018) have proposed the generation of binary-encoded AAs. Four bounding methods were used in this work to generate the perturbations, two of which depended on FGSM $^k$ (Kurakin, Goodfellow, and Bengio Reference Kurakin, Goodfellow and Bengio2016) with the deterministic rounding (dFGSM $^k$ ) and the randomized rounding (rFGSM $^k$ ) versions. These two methods are identical to the $L_\infty$ -ball method (Goodfellow et al. Reference Goodfellow, Shlens and Szegedy2014) used to constrain the perturbations for CV tasks. The third and fourth methods employ the multi-step bit gradient ascent and the multi-step bit coordinate ascent, respectively.
2.2 Adversarial training
The goal of using AT is to produce neural networks robust to attacks. Several techniques have been proposed in this area of study and most of these techniques focus on training the neural network on adversarial examples. As mentioned in Subsection “Adversarial Attack in NLP,” Szegedy et al. (Reference Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow and Fergus2013) were the first to employ training on both clean and adversarial samples. This work was followed by the emergence of FGSM (Goodfellow et al. Reference Goodfellow, Shlens and Szegedy2014) that proposed an efficient and fast way to generate adversaries and train the model on these crafted examples. The linear approximation of the loss function was one of the disadvantages of this method that becomes more apparent when testing the model on iterative attacks (Tramèr et al. Reference Tramèr, Kurakin, Papernot, Goodfellow, Boneh and McDaniel2017).
Apart from the previous works where models were trained on a mixture of clean and perturbed data, a stream of research has taken the path of training on adversarial examples only, such as Huang et al. (Reference Huang, Xu, Schuurmans and Szepesvári2015) and Shaham et al. (Reference Shaham, Yamada and Negahban2018). However, training on adversarial examples only renders the model vulnerable to overfitting the adversarial examples (Zhang and Xu Reference Zhang and Xu2019), especially with relatively strong adversaries that generate samples that cross the decision boundaries and are more like natural samples, yet belonging to different classes.
Several frameworks were proposed to address the problem of overfitting to adversarial examples, including the Curriculum AT (Cai et al. Reference Cai, Du, Liu and Song2018) that gradually increases the steps of the projected gradient descent (PGD) (Madry et al. Reference Madry, Makelov, Schmidt, Tsipras and Vladu2017) to generate stronger attacks during the course of training until the model learns the sufficient weights to overcome the attacks. Zhang et al. (Reference Zhang, Xu, Han, Niu, Cui, Sugiyama and Kankanhalli2020b) proposed Friendly AT that employs early stopping when searching for adversarial examples using PGD to generate friendly adversarial data to minimize the adversarial loss rather than maximizing it. They concluded that adversarial robustness does not contradict natural generalization.
Other AT methods rely on adding a regularization term in the loss function that minimize the distance between the clean and perturbed examples. Zhang et al. (Reference Zhang, Yu, Jiao, Xing, El Ghaoui and Jordan2019) proposed TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization, a regularized loss function that penalizes adversarial examples that have a distribution with a large Kullback-Leibler divergence (Kullback and Leibler, Reference Kullback and Leibler1951) from the distribution of the clean examples. This regularized loss does not take into consideration whether or not the benign samples have been classified correctly by the model. Wang et al. (Reference Wang, Zou, Yi, Bailey, Ma and Gu2019) proposed Misclassification Aware adveRsarial Training which incorporates the probability of the ground-truth label generated by the model to emphasize on samples that have been already correctly classified in the clean mode.
One of the most commonly used AT technique in the natural language processing domain is DDA proposed by Ganin and Lempitsky (Reference Ganin and Lempitsky2015) which aims at learning domain-invariant, yet discriminative representations. In this algorithm, the model is trained on two datasets, one labeled and one unlabeled, and is equipped with two heads: a predictor for the main task and a discriminator to distinguish between the samples between the sources of the samples. One of the critical factors for the success of this algorithm is the gradient reversal layer, which simply multiplies the gradients derived from the domain classification head by a negative scalar to encourage the feature extractor part to learn domain-invariant features.
Finally, in this work, we aim at using one of the most successful AT techniques that have proved its effectiveness on several benchmark datasets, namely Adversarial Weight Perturbation-based AT (AT-AWP) proposed by Wu et al. (Reference Wu, Xia and Wang2020). The algorithm aims at producing a double-perturbation effect on the network by iteratively perturbing both the input samples and model weights through PGD.
2.3 Using adversarial training in Arabic NLP
There is a severe shortage in the literature regarding works employing AT in Arabic NLP. Most of these works use DDA to approach multi-lingual or cross-lingual tasks by enabling knowledge transfer from a high-resource language with plenty of labeled datasets to another low-resource language, mostly unlabeled datasets (Joty et al. Reference Joty, Nakov, Màrquez and Jaradat2017; Zalmout and Habash Reference Zalmout and Habash2019; Gupta Reference Gupta2021; Goyal et al. Reference Goyal, Singh and Kumar2021).
Among those works, Chen et al. (Reference Chen, Sun, Athiwaratkun, Cardie and Weinberger2018) exploited the availability of abundant annotated resources for sentiment classification in the English language by employing the AT algorithm proposed by Miyato et al. (Reference Miyato, Dai and Goodfellow2016) to learn discriminative, language-invariant features that can transfer to other low-resource languages. The model performance was evaluated using two unlabeled Arabic and Chinese datasets, where the language discriminator was trained to distinguish between the labeled and unlabeled samples; meanwhile, the sentiment classification head was trained to predict the polarity of the input samples.
Qin et al. (Reference Qin, Chen, Tian and Song2021) propose the employment of a regularized decoder and AT to approach the Arabic diacritization problem. Previous works treated the auto-generated diacritics as gold labels; meanwhile, a high fraction of these labels is inaccurate and would potentially lead to a model producing deficient representations. In this work, the authors developed a model that employed AT inspired by Ganin and Lempitsky (Reference Ganin and Lempitsky2015) to balance the information learned from both the main tagger and the regularized decoder.
Joty et al. (Reference Joty, Nakov, Màrquez and Jaradat2017) designed an adversarial training-based framework inspired by Ganin and Lempitsky (Reference Ganin and Lempitsky2015) to learn discriminative, language-invariant representations to approach the task of question-question similarity re-ranking. The proposed model is trained to receive a batch of labeled and unlabeled inputs from two different languages and is encouraged to discriminate between the labeled samples from the unlabeled ones through the label classifier network. At the same time, a gradient reversal layer is adopted to encourage the model to learn language-invariant features that can fool the language discriminator.
Zalmout and Habash (Reference Zalmout and Habash2019) employed both adversarial training and multitask learning in addressing the data sparsity problem in the Arabic language that stems from both morphological richness and dialectical variations. This paper was the first to use adversarial domain adaptation (Ganin and Lempitsky Reference Ganin and Lempitsky2015; Ganin et al. Reference Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky2016) in the field of dialectical morphological adaptation. The proposed framework was evaluated on two datasets including Modern Standard Arabic (MSA) and the Egyptian dialect.
Gupta (Reference Gupta2021) investigated the possibility of leveraging unlabeled data from different languages to improve the performance on the multi-label emotion recognition task. They formulated a semi-supervised Virtual Adversarial Training (VAT (Miyato et al. Reference Miyato, Maeda, Koyama and Ishii2018) problem and investigated the improvement in the performance of a target language classification task driven by leveraging unlabeled datasets of other low-resource languages.
Goyal et al. (Reference Goyal, Singh and Kumar2021) approached the task of word in context disambiguation proposed by SemEval-2021 Task 2 (Martelli et al. Reference Martelli, Kalach, Tola and Navigli2021) in both multi-lingual and cross-lingual settings via a single XLM RoBERTa (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2019) model. The authors reported a significant boost in performance of the model when employing the adversarial training stage proposed by Miyato et al. (Reference Miyato, Dai and Goodfellow2016), which simply perturbs the word embeddings during training using the calculated gradients with respect to the embedding vectors. The authors added a simple modification to the adversarial training algorithm by skipping the embedding normalization process which they believed would affect the semantic meaning of the pre-trained word embeddings.
Based on this literature review, we note the lack of use of AT techniques in the Arabic NLP domain to build robust frameworks and the need to thoroughly investigate the applicability of such techniques in this domain.
2.4 Arabic sentiment analysis
Sentiment analysis (SA) is the task of determining the affective states in a given text (Darwish et al. Reference Darwish, Habash, Abbas, Al-Khalifa, Al-Natsheh, Bouamor, Bouzoubaa, Cavalli-Sforza, El-Beltagy and El-Hajj2021). Due to its important applications, SA has been studied extensively in the literature. These applications include the following: business analysis (Han et al. Reference Han, Liu, Yang and Jiang2019; Bose et al. Reference Bose, Dey, Roy and Sarddar2020), review analysis (Mackey, Miner, and Cuomo Reference Mackey, Miner and Cuomo2015), healthcare (Clark et al. Reference Clark, James, Jones, Alapati, Ukandu, Danforth and Dodds2018), and stock market analysis (Xing et al. Reference Xing, Cambria and Welsch2018). The significance of SA is mainly attributed to the nature of this area of study that aims at automatically analyzing the physiological state, satisfaction, or impression of the user based on their responses (Wankhade et al. Reference Wankhade, Rao and Kulkarni2022).
Several works have approached the task of SA in the Arabic language domain, however, these studies still report the challenges faced due to the nature of this language. These challenges, including morphological richness, orthographic ambiguity, orthographic inconsistency, and resource poverty, hinder the research progress in Arabic NLP (Darwish et al. Reference Darwish, Habash, Abbas, Al-Khalifa, Al-Natsheh, Bouamor, Bouzoubaa, Cavalli-Sforza, El-Beltagy and El-Hajj2021). For example, Khalifa et al. (Reference Khalifa, Habash, Abdulrahim and Hassan2016) have reported the drop in the performance of the state-of-the-art tool for the task of part-of-speech tagging and lemmatization from 96% on MSA on both tasks (Pasha et al. Reference Pasha, Al-Badrashiny, Diab, El Kholy, Eskander, Habash, Pooleery, Rambow and Roth2014) to around 72% and 64% on the Arabic Gulf dialect. Orthographic inconsistency, on the other hand, plays a major role in impeding the progress in Arabic NLP. The great variation in spelling the same words makes the task of indexing and tokenization much harder, and researchers need to spend tremendous amounts of time on spell checking before applying the typical text processing stages. According to Zaghouani et al. (Reference Zaghouani, Mohit, Habash, Obeid, Tomeh, Rozovskaya, Farra, Alkuhlani and Oflazer2014), one-third of all MSA words available online are misspelled.
Early efforts in Arabic SA revolved around the collection and creation of the resources required to approach this task including the labeled datasets, sentiment treebanks, and sentiment lexicons (Abdul-Mageed and Diab Reference Abdul-Mageed and Diab2012; Mourad and Darwish Reference Mourad and Darwish2013; Badaro et al. Reference Badaro, Baly, Hajj, Habash and El-Hajj2014; Refaee and Rieser Reference Refaee and Rieser2014; ElSahar and El-Beltagy Reference ElSahar and El-Beltagy2015; Eskander and Rambow Reference Eskander and Rambow2015; Khalil et al. Reference Khalil, Halaby, Hammad and El-Beltagy2015; Salameh, Mohammad, and Kiritchenko Reference Salameh, Mohammad and Kiritchenko2015; Shoukry and Rafea Reference Shoukry and Rafea2015; El-Beltagy Reference El-Beltagy2016; Baly et al. Reference Baly, Hajj, Habash, Shaban and El-Hajj2017). Moreover, some efforts have been put to create benchmarks against which different approaches can be fairly compared (Mohammad et al. Reference Mohammad, Bravo-Marquez, Salameh and Kiritchenko2018; Rosenthal, Farra, and Nakov Reference Rosenthal, Farra and Nakov2019).
Arabic SA has been approached using three main methods including hand-crafted rules and lexicons (El-Beltagy and Ali Reference El-Beltagy and Ali2013; Abdul-Mageed and Diab Reference Abdul-Mageed and Diab2014; Badaro et al. Reference Badaro, Baly, Hajj, Habash and El-Hajj2014), machine learning algorithms (Baly et al. Reference Baly, Hajj, Habash, Shaban and El-Hajj2017; Badaro et al. Reference Badaro, El Jundi, Khaddaj, Maarouf, Kain, Hajj and El-Hajj2018; Farha and Magdy Reference Farha and Magdy2019), and hybrid frameworks that combine the first two approaches (Al-Smadi et al. Reference Al-Smadi, Al-Ayyoub, Jararweh and Qawasmeh2019a). More recent approaches (Abdul-Mageed et al. Reference Abdul-Mageed, Zhang, Elmadany and Ungar2020; Antoun, Baly, and Hajj Reference Antoun, Baly and Hajj2020) include the fine-tuning of large pre-trained models such as AraBERT (Antoun et al. Reference Antoun, Baly and Hajj2020) have reported state-of-the-art results in Arabic SA.
DL has been extensively used to solve SA tasks in the English language. One of the main advantages of the use of DL models is the possibility to build an end-to-end fully automated framework that can infer the important features to achieve a task like SA which is, in its core, a conventional classification task. This advantage relieves the restriction of the need for domain experts to handcraft the rules based on which the SA task can be accomplished (Badaro et al. Reference Badaro, Baly, Hajj, El-Hajj, Shaban, Habash, Al-Sallab and Hamdi2019). As a result, different DL architectures have been used in the English NLP domain spanning a wide range of sophisticated trainable layers including recursive neural networks (RNNs) (Socher et al. Reference Socher, Perelygin, Wu, Chuang, Manning, Ng and Potts2013), convolutional neural networks (CNNs) (Kalchbrenner, Grefenstette, and Blunsom Reference Kalchbrenner, Grefenstette and Blunsom2014), gated recurrent neural networks (GRNNs) (Tang, Qin, and Liu Reference Tang, Qin and Liu2015), dynamic memory networks (DMNs) (Kumar et al. Reference Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus and Socher2016), and the human reading for sentiment framework (Baly et al. Reference Baly, Hobeica, Hajj, El-Hajj, Shaban and Al-Sallab2016).
Inspired by this literature in English SA, several Arabic SA studies have emerged starting from Al Sallab et al. (Reference Al Sallab, Hajj, Badaro, Baly, El-Hajj and Shaban2015) who evaluated different DL models including DNNs, deep belief networks (DBNs), and deep autoencoders (DAEs) which were trained based on word n-grams. Their work outperformed the previous Arabic state-of-the-art benchmark SVM models; however, the model severely suffered from the data sparsity problem, i.e. the scarcity of mentioning the vast majority of the corpus tokens which does not allow a thorough understanding of the language by the model. This arouses the need for using Arabic word embeddings.
Al-Sallab et al. (Reference Al-Sallab, Baly, Hajj, Shaban, El-Hajj and Badaro2017) proposed a recursive deep learning model for opinion mining in Arabic AROMA which was introduced as a solution for the problems that the author believed were the causes of the gap between the performance of the models on Arabic and English SA tasks. The authors remarked a number of potential limitations in the Arabic language as compared to English including lexical sparsity and non-standardized dialects which causes different spellings for the same word sense.
CNNs have been exploited for the purpose of Arabic SA in many works (Dahou et al. Reference Dahou, Xiong, Zhou, Haddoud and Duan2016; Gridach, Haddad, and Mulki Reference Gridach, Haddad and Mulki2017; Alayba et al. Reference Alayba, Palade, England and Iqbal2018b) utilizing the concept of learnable word embeddings. Alayba et al. (Reference Alayba, Palade, England and Iqbal2017, Reference Alayba, Palade, England and Iqbal2018a) compared the performance of CNN and long-short-term memory (LSTM) layers by feeding them the features extracted from the trainable word, character, and character n-gram embeddings. The accuracy of CNNs was reported to be significantly better than LSTMs; 90% compared to 85%, respectively.
Al-Azani and El-Alfy (Reference Al-Azani and El-Alfy2018) compared the performance of both LSTMs and GRNNs in both the unidirectional and bidirectional settings. It was reported the outperformance of the bidirectional LSTMs compared to traditional classification methods when emojis are inserted into the inputs. Al-Azani and El-Alfy (Reference Al-Azani and El-Alfy2017) also studied and evaluated the performance of LSTMs, GRNNs, CNNs, and the combination of them on the Arabic SA task. The hybrid system of both LSTMs and CNNs outperformed the other architectures on two datasets.
Others prefer to process the whole sequence as a single unit by generating paragraph-based embeddings rather than token-based embeddings (Barhoumi et al. Reference Barhoumi, Estève, Aloulou and Belguith2017; Abdullah and Shaikh Reference Abdullah and Shaikh2018). Barhoumi et al. (Reference Barhoumi, Estève, Aloulou and Belguith2017) employed the sentiment annotated corpus large-scale Arabic book reviews (LABR) Aly and Atiya (Reference Aly and Atiya2013) and Doc2Vec (Le and Mikolov Reference Le and Mikolov2014) to generate paragraph embeddings for Arabic contents and then built a classifier (MLP or Logistic regression) on top of these feature vectors.
LSTMs (unidirectional and bidirectional), GRNNs, and CNNs have been pretty much popular at the task of Arabic SA before the advent of the transformers. González et al. (Reference González, Pla and Hurtado2017); Al-Smadi et al. (Reference Al-Smadi, Talafha, Al-Ayyoub and Jararweh2019b); Barhoumi et al. (Reference Barhoumi, Estève, Aloulou and Belguith2017); El-Beltagy et al. (Reference El-Beltagy, Khalil, Halaby and Hammad2017, Reference El-Beltagy, Kalamawy and Soliman2016) have used CNNs or LSTMs or a hybrid of both to generate the features used by the end classifier for Arabic SA.
Several techniques have been proposed to improve the models’ performance on SA tasks such as the one suggested by Duwairi and Abushaqra (Reference Duwairi and Abushaqra2021). In this paper, the authors designed a novel framework to augment the dataset used in Arabic SA taking advantage of the morphological richness of the language. The authors predefined 23 transformation rules based on which the dataset was augmented. The algorithm has been reported to increase the size of the initial seed dataset by 10 folds with a significant increase in the model’s accuracy.
Based on this literature review, and to the best of our knowledge, there have been very few works that employed AT in the Arabic language domain. These works were designed to develop domain-invariant models capable of transferring knowledge from one domain to another rather than employing AT in developing models robust to adversarial attacks. In this paper, we aim, in the first stage, at building several DL models, train them under standard conditions to solve an Arabic SA task, and then evaluate their performance on previously unseen data both clean and perturbed. The kind of perturbation intended to be used in this work is the fast gradient projected method proposed by Wang et al. (Reference Wang, Yang, Deng and He2021) with some modifications to cope with the scarcity of Arabic sources. In the second stage, we will train the same models; however, this time adversarially and evaluate their performance on both clean and adversarial examples that have not been passed to the model during training. The method selected for AT is adversarial weight perturbation proposed by Wu et al. (Reference Wu, Xia and Wang2020) which has recently proved its effectiveness on several Kaggle NLP competitionsFootnote a.Footnote b
3. Materials and methods
3.1 Dataset
The dataset used in this work is the LABR dataset collected by Aly and Atiya (Reference Aly and Atiya2013). The dataset consists of over 63,000 Arabic book reviews, where each textual review is associated with a rating between 1 and 5. The dataset has been shuffled and split into training and test sets by the authors to form a benchmark against which different approaches can be fairly evaluated. Moreover, the authors established other criteria to split the dataset based on including the balance of the target labels and the number of classes to consider.
In this work, we select the balanced partitioning that contains 16,448 samples. The authors use the technique of downsampling to balance the dataset which limits the number of samples in each class to the class with the minimum number of samples. For the number of classes, we select to approach this task as a binary classification task, where each sequence is classified as either positive or negative. In this case, classes 4 and 5 are merged into class 1 (positive), classes 1 and 2 are merged into class 0 (negative), and class 3, which indicates the neutral class, is omitted.
After preprocessing the data as explained in the Data Preprocessing subsection, the clean dataset that can be used for training the DL models was reduced to 14,309 samples. Besides the original splitting of the dataset into training and test splits, we extract a validation split from the training data to secure fair hyperparameter tuning on this split and unbiased evaluation on the test split. Eventually, we ended up with training, validation, and test splits of 9,981, 1,462, and 2,902 samples, respectively. The validation set was randomly selected after shuffling the training data and securing the balance of labels in the validation split via the stratified K-fold cross-validation.
3.2 Data preprocessing
First of all, we remove the long sequences that contain over 128 tokens after tokenization which is achieved using the natural language toolkit (NLTK) (Bird, Klein, and Loper Reference Bird, Klein and Loper2009). This step of removing long sequences reduces the computational load and eventually allows for faster iterating. Figure 1 shows the percentages of the long and short sequences in both the training and test splits. As explained in the Dataset subsection, we end up with 11,443 examples in the training set (before splitting it into train and validation splits) and 2,902 examples in the test set.
After that, we remove the Arabic diacritics from the reviews as these tokens are not represented in AraVec (Soliman, Eissa, and El-Beltagy Reference Soliman, Eissa and El-Beltagy2017) which contains the pre-trained word embeddings as will be explained later. Then, all the unique characters in both the training and test splits were retrieved, and we ended up with 93 and 92 unique characters in both datasets, respectively. A list containing all the Arabic 45 characters was created and all the non-Arabic characters were detected in both datasets based on this list and replaced with a space that represents the separator token. There are actually 28 characters in the Arabic language, however, some characters exist in different forms like “”–“A” which can be found as “”–“” and “”–“”.
Any character that appears more than twice consecutively in the same word was replaced with a single character as this cannot take place in the Arabic language. This process included the spaces. Furthermore, some of the characters in the dataset were written in a way different from the ones available in the pre-collected characters. For example, “”–“t” can be found as “” and “”–“E” can be found as “” which will eventually lead to an out-of-vocabulary (OOV) token when tokenizing and embedding. Moreover, for the sake of normalization, AraVec replaces a set of characters with similar ones such as “”–“p” is always replaced with “”–“h”, “”–“”, “”–“” are always replaced with “”–“A”, and “”–“&” and “”–“}” are always replaced with “”–“’’. Hence, we perform all these normalizations in order to minimize OOV tokens when embedding our sequences.
3.3 Sequence tokenization and word embeddings
First, all the input sequences were tokenized using the NLTK tokenizer (Bird et al. Reference Bird, Klein and Loper2009). After that, we added two special tokens to the dictionary, namely the padding token “” and the unknown token “.” After that, each token was given an index starting from 0 and ending with the number of tokens in AraVec (Soliman et al. Reference Soliman, Eissa and El-Beltagy2017) in addition to the two special tokens. AraVec contains Arabic word embeddings pre-trained on Arabic corpus based on the word2vec algorithm (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013). There are several versions of AraVec, and in this work, we employ the Twitter-CBOW version where each token is represented with a 300-d vector.
3.4 Deep learning models and training settings
Seven different DL models are used in this study: LSTM-based (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997), gated recurrent unit (GRU)-based (Chung et al. Reference Chung, Gulcehre, Cho and Bengio2014), CNN-based (LeCun et al. Reference LeCun, Bottou, Bengio and Haffner1998), multi-headed self-attention (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) (MHA)-based, LSTM-CNN-based, GRU-CNN-based, and MHA-CNN-based models. Figure 2 illustrates the model architectures used in this work. The architectures, in general, begin with the embedding layer, and AraVec’s pre-trained weights are used to initialize the weights of this layer.
For the first architecture, shown in Figure 2—left, we pass these embeddings to one of three layers: LSTM, GRU, or CNN. For the LSTM and the GRU layers, we use the bidirectional setting with the two sequential layers where we use 128 and 64 units, respectively, in the two layers. Following the latter are both the global average pooling and global max pooling layers which reduce the dimensionality of the previous output and generate a 1D representation for each input sequence. The pooled representations are then concatenated. Using both pooling layers and concatenating their outputs rather than depending on one of them gives the model a better chance to learn from the features of each layer which might not be easy to extract by the other one. The outputs are then passed to a fully connected layer with 64 units and ReLU activation. Finally, a head classifier with two units and softmax activation is used to predict the probability of each class given the input sample.
For the CNN-based model, the same previous architecture is used with the replacement of the LSTM/GRU layers with two sequential 1D CNN layers. The two layers have 128 kernels with a 3 $\times$ 3 kernel size and ReLU activation. The remaining part of the architecture is the same as the LSTM and GRU ones. For the MHA-based model, we replace the LSTM layers with two MHA layers where the first of which receives its inputs from the summation of the embedded vectors and the sinusoidal positional encodings generate the same way as in Vaswani et al. (Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017). The first and second MHA layers have eight heads both and key dimensions of 128 and 64, respectively. Finally, three additional architectures are used that employ the LSTM-based, GRU-based, and MHA-based with an additional 1D CNN layer in between the embedding layer and these layers. The 1D CNN layer has 128 3 $\times$ 3 kernels and is ReLU-activated.
For the training process, we first freeze the weights of the embedding layer in order to preserve the semantics embedded in the vectors. This process of freezing the weights of the embedding layer is essential for the AA process where the attack depends mainly on synonym replacement that requires this preservation of vector semantics as will be explained later in the Adversarial Attack and Evaluation Subsection. Each model is trained on the training set for 10 epochs with an initial learning rate of 1e-3 that is linearly decaying over the batches until reaching the minimum value of 3e-4 with the last batch of the final epoch. The training data is continuously shuffled with a buffer size of 256. The batch size used is also set to 256. Adam optimizer with weight decay (Loshchilov and Hutter Reference Loshchilov and Hutter2017) is used to optimize the weights of the models. The weight decay is set to 1e-4 which helps remedy the overfitting problem.
3.5 Evaluation metrics
The loss function alongside both the weighted F1-score and the accuracy score is used to monitor the training progress. The loss function used is the binary cross-entropy (BCE) loss function, the mathematical formulation of this loss function is presented in equation (1). We also preferred to maintain both the weighted F1, formulated in equation (2), and the accuracy, formulated in equation (3), scores as measuring metrics to produce comparable results to the LABR benchmarks which used both of these metrics.
where $n$ is the number of samples, $y$ are the actual values, $\widehat{y_i}$ are the predicted values, $C$ is the number of classes, $n_c$ is the number of samples in class $c$ , $TP$ are the true positives, $FP$ are the false positives, and $FN$ are the false negatives.
3.6 Adversarial attack and evaluation
In this work, we employ an automatic synonym replacement technique for AA. Before starting the AA, we start with pre-defining the 10 nearest neighbors to each token in the dictionary based on the cosine similarity between the representations of these tokens in AraVec embeddings. Constructing this dictionary takes place as follows. First, we create an empty dictionary and populate its keys with the unique tokens of AraVec. After that, we iterate over each token embedding and compute the cosine similarity between the current token embeddings and all other embeddings. For each token, we end up with a vector of the length of the number of unique tokens in AraVec excluding the current token. This vector represents the cosine similarity between the token embeddings. Based on that, we select the ten tokens with the highest cosine similarity with the current token. These ten tokens are added to the dictionary as values corresponding to the input token key.
During AA, we perform the synonym replacement which is based on PGD (Madry et al. Reference Madry, Makelov, Schmidt, Tsipras and Vladu2017) with an additional nearest neighbor’s post-searching step. First, the model makes a forward propagation on the input sequences. Then, the loss is computed and the gradients are derived. We then use the gradients of the model with respect to the input embeddings to create the perturbations. As the representation of the inputs has the dimensions of [batch size, 128, 300], where 128 is the sequence length and 300 is the length of the representation of each token, also their gradients have the same dimensions. We first project the gradients to make them have the same L $_2$ -norm as the corresponding inputs. This projection takes place according to equation (4).
where $\nabla _x(\theta,x,y)$ are the gradients with respect to input(s) $x$ given network $f_{\theta }$ , where $\theta$ are the network’s weights, and label(s) $y$ . $\|.\|$ is the L $_2$ -norm.
We then add these projected perturbations to the input embeddings, going in the same direction of maximizing the loss. The newly generated adversarial examples do not represent any word in the embedding space, and hence we need to approximate the adversarial example by searching for the nearest neighbors to them. The searching step here depends on the predefined nearest neighbors dictionary that has been collected prior to the AA stage and the process takes place based on the cosine similarity metric. After generating these adversarial examples, they are forward propagated through the network, and all the metrics are computed to evaluate the model’s performance adversarially. Algorithm 1 shows the steps of the proposed AA technique in detail. Table 1 exhibits some of the adversarial examples and their translations generated from clean ones based on the LSTM model after applying standard training.
3.7 Adversarial training
This stage is inspired by the AT-AWP proposed by Wu et al. (Reference Wu, Xia and Wang2020). This adversarial training algorithm generates perturbations based on the input examples as well as the network’s trainable weights, causing a double-perturbation effect. It tries to flatten the loss landscapes of both the inputs and the weights resulting in networks much more robust to attacks.
As this stage of AT contains two steps, namely input perturbation and weight perturbation, we perform the input perturbation first in the same way as explained in the Adversarial Attack and Evaluation subsection. Then, we perform the weight perturbation by means of the adversarial examples, generated in the first step, as follows. The perturbed inputs are forward propagated through the network and the loss is computed. After generating the gradients with respect to the trainable weights, the weights get perturbed by these gradients after they are projected based on the weights of the corresponding layers. The projection process takes place the same way as formulated in equation (4), however, this time the projection is for the gradients of the weights rather than the network’s inputs. After perturbing the weights, we calculate the difference between the original and the perturbed weights in a layer-wise manner. Then, we train the network either using the perturbed examples alone (in one setting) or both the clean and perturbed examples (in another setting). Finally, we add the differences that have been calculated between the original and perturbed weights in order to reset the perturbation and allow the new batch to have almost the same contribution to the perturbations. The advantage of perturbing the weights beside the input samples is that these attacks are generated based on all the samples in the input batch rather than individual samples which ultimately approximates the worst AA much better than if only the inputs are perturbed. This, in turn, forces the network to learn weights that make it more robust to attacks.
When performing AT, we experiment with three different scenarios. The first scenario is training the network on perturbed samples only, without perturbing the weights, refer to Algorithm 2. The second scenario is training the network on both perturbed samples and perturbed weights, refer to Algorithm 3. The third scenario is using both clean and perturbed samples to train the network on the perturbed weights, refer to Algorithm 4. In the third case, we choose to apply training using clean and adversarial samples employing the second scenario rather than the first one because of its higher performance.
4. Results and discussions
4.1 Background
In this section, we display and discuss the results of the LSTM-based model which has proved the best performance among the other architectures after being adversarially trained. For the other models, the patterns are similar, and hence, there is no need to repeat similar discussions. However, we supply the figures of the other models in Appendix One on page 46 for the readers’ reference. Figure 3 illustrates the results of LSTM standard training, and Figures 4, 5, and 6 show the results of the three adversarial training scenarios. Both the weighted F1-score and the accuracy score alongside the binary cross-entropy loss value are displayed. During standard training, we monitor the weighted F1 score on the validation set and save the model’s weights based on the best performance on this metric. For the AT case, we monitor an engineered metric that helps balance the model’s performance on both types of data (clean and adversarial). This metric will be explained in more detail later on in this section.
4.2 Standard training & adversarial attack
Starting with the standard training process displayed in the three plots of Figure 3, we can see the consistent decrease in the loss value accompanied by an increase in the values of the weighted F1 and the accuracy scores on the training set. The model reaches its best performance on the validation set at the eighth epoch where the loss value, the weighted F1-score, and the accuracy score on the validation and test splits are respectively: (0.414, 0.402), (0.812, 0.824), and (0.812, 0.824). After attacking this model, we see the huge drop in performance where the loss value of both the validation and test sets almost doubles to 0.874 and 0.830, respectively. Both the weighted F1-score and the accuracy score are also negatively affected as they drop on the validation and test sets, respectively, to (0.561, 0.576) and (0.561, 0.576). Hence, this indicates the success of establishing a proper AA method that can fool a model trained to solve an Arabic SA task.
4.3 Adversarial training
Next comes the AT step where the task is to build a robust model to AA. As explained in Section 3.7, we experiment with three different scenarios of AT. During the AT stage, we create a new metric which is the weighted mean of the weighted F1-score (WMWF1-score) on both the clean and adversarial examples of the validation set. The mathematical formulation of this metric is presented in equation (5). The weights are heuristically given to the two contributors based on experimentation. For the first and second scenarios, we give a clean-to-adversarial weight ratio of 1:2. This is because the model in these two modes is trained on the adversarial examples only and as the model progresses in training, it starts to overfit the adversarial examples at the expense of the clean ones. In this case, if we give full credit to the adversarial examples, the model will be significantly biased towards these examples, which would eventually lead to a significant drop in the model’s performance on the clean data. We have experimented with other values for this ratio such as 2:1 for these two scenarios, however, we found the latter hinders the model from a significant boost in the performance on the adversarial data at the expense of a slight improvement on the clean one. For the third scenario, on the other hand, we reverse the values and give a clean-to-adversarial ratio of 2:1 as we note that the model starts early to have better performance on the adversarial data; meanwhile, it suffers on the clean data. Hence, our goal was to find a model that performs almost equally on both types of data, and the 2:1 ratio has proven to be working best for this purpose among other ratios.
where $2$ is the number of input types (clean and adversarial), $w_t$ is the chosen weight for input type $t$ , $C$ is the number of classes, $n_c$ is the number of samples in class $C$ , $TP$ are the true positives, $FP$ are the false positives, and $FN$ are the false negatives.
4.3.1 Scenario 1
Regarding the first scenario, where the perturbations are applied to the inputs only and the model is trained on adversarial examples only, we see that the model fails to improve after the first epoch, based on the WMWF1-score; refer to Figure 4. Although there is some improvement in the performance on the adversarial data of the validation set, this improvement is accompanied by an overfitting to this data and a steep drop on the clean one. This behavior is not common among the other architectures as shown in the figures in Appendix One on page 46. Those models continue their improvement on the WMWF1-score until after the third epoch in the worst case, i.e. the case of the GRU-based model, Figure A1. However, five out of the seven models fail to improve on the clean validation data after the fifth epoch which, in turn, indicates the tendency of these models to overfit the adversarial data as well. This tendency to overfit the adversarial data has been mentioned in several research papers including Rice et al. (Reference Rice, Wong and Kolter2020); Cai et al. (Reference Cai, Du, Liu and Song2018); Zhang et al. (Reference Zhang, Xu, Han, Niu, Cui, Sugiyama and Kankanhalli2020b). In order to remedy the problem of overfitting the adversarial examples, we adopt saving the weights of the best-performing model on the WMWF1-score. This method, at its core, is similar to early stopping which has been proposed as a solution for this problem by Zhang et al. (Reference Zhang, Xu, Han, Niu, Cui, Sugiyama and Kankanhalli2020b). However, in our work, we continue the training process until the tenth epoch, and during this period, the weights of the best-performing model on the custom metric are saved and then used for evaluation and comparison. For this AT mode, the best-performing model is the GRU-based model with a WMWF1-score of 0.780 where the weighted F1-scores on both the clean and the adversarial examples of the validation set are 0.732 and 0.875, respectively. The performance on the test set is comparable with weighted F1-scores of 0.745 and 0.867 on both the clean and the adversarial data, respectively.
4.3.2 Scenario 2
For the second scenario, there is a noticeable improvement in the training process as the models get the chance to train longer before starting to overfit the adversarial samples. This behavior can be noticed through the fact that all the models need at least five epochs to converge, meanwhile, some models show their tendency to improve even after the tenth epoch such as in the case of the MHA-based model illustrated in Figure A2. Overall, all the models need one to eight additional epochs over scenario 1 to obtain the convergence weights. This, in turn, indicates the effectiveness of employing weight perturbations beside the input perturbations. Moreover, not only the training process was allowed to run for more epochs before overfitting, instead, the overall performance of the models also improved. For the case of the LSTM-based model, the weighted F1-scores on the clean and adversarial examples of the test set are 0.732 and 0.886, respectively, as compared to 0.706 and 0.836, respectively, in scenario 1; refer to Figure 5.
4.3.3 Scenario 3
Finally, for the third scenario where the perturbations are applied to both the inputs and the weights and the model is trained on both clean and adversarial data, we can see that the overall performance of the models has improved, especially on the clean data. Besides the improvement in performance, it is remarkable the faster convergence of the models to the optimal weights in most of the cases (5 out of 7 models) as compared to scenario 2. The LSTM-based model in this scenario has shown a high and comparable performance on both the clean and adversarial data with weighted F1-scores of 0.815 and 0.843, respectively, on the validation set, and 0.816 and 0.847, respectively, on the test set; refer to Figure 6.
4.4 General results
Tables 2 and 3 present the results of the standard and adversarial training modes, respectively. Based on the results of the standard training setting, we can see that the best-performing model on the clean test set is the LSTM-based model. However, the score on the adversarial data is much lower than the one on the clean data; 0.576 compared to 0.824, respectively, on both metrics. Moreover, we notice that the addition of a 1D CNN layer to the GRU and the MHA models helped the models perform better on the clean data, but had a negative impact on the adversarial one. The LSTM model, on the other hand, was negatively affected by the addition of the CNN layer on both data types. With regard to AT, and in general, it is noticeable the positive impact of the additional CNN layer to the LSTM and GRU models on the performance of these models on the clean data in the first two scenarios. On the contrary, the pure models outperformed the CNN-equipped models on the adversarial task which indicates their attendance to overfit the training data type and indicates the impact of the additional 1D CNN layer on the clean/adversarial generalization. Finally, for scenario 3, the best-performing model on the clean validation set is the LSTM-based model, meanwhile, the LSTM-CNN-based model outperformed it on the clean test set.
4.5 Remarks
Based on these results, there are several points that need to be discussed and clarified. The first point is that we notice some models perform better according to the value of the loss function; however, these models are outperformed by other models in terms of both the weighted F1-score and the accuracy metric. For instance, the best-performing model under standard training conditions with respect to the loss function is the LSTM model with a loss value of 0.414 on the clean validation set, refer to Table 2. Nevertheless, this model is left behind regarding the two metrics with weighted F1 and accuracy scores of 0.812 as compared to the GRU-based model which scores 0.815 and 0.816 on both metrics, respectively, in spite of its higher loss value of 0.589. This phenomenon can be attributed to the slight miscorrelation between the loss function and the evaluation metrics as these metrics depend on the principle of thresholding in their work. Standard thresholding in binary or multi-class classification tasks means mapping the output probability of the corresponding class to 1 if the probability is higher than 0.5; otherwise, it is mapped to 0. Some applications require different threshold values based on the point of interest in those applications; however, our study follows the standard case. The cross-entropy loss, on the other hand, rewards or punishes models based on their confidence in the predicted class label. For example, if a model predicts class 1 with probabilities of 1.0 and 0.51 for two different samples where the ground-truth label is 1, this will produce loss values of 0 and 0.673, respectively. However, the F1 and the accuracy scores in both cases will be 1 due to thresholding. Thus, the cross-entropy loss function is sensitive to the probability values; meanwhile, the evaluation metrics used in this study only care about the mapped predictions.
The second point that we need to pay attention to is that after applying adversarial training on the models, we notice that the models better predict the adversarial samples either on the validation or the test splits than it does on the clean ones. There are actually three potential hypotheses that can explain this phenomenon. First, applying perturbations to the input representations only is not guaranteed to generate the worst-case attack, that is because these gradients work best if and only if they are used to generate perturbations for the inputs in conjunction with perturbing the trainable weights of the model as these gradients are derived based on the chain rule. This chain of derivatives (gradients) must be taken as a whole in order to guarantee applying the worst-case attack. Hence, applying perturbations to the inputs only is just an optimistic approximation for the best attack. Second is the nearest neighbors search process which replaces the newly perturbed sample with its nearest neighbor from a predefined dictionary. This process forms a second approximation for the already approximated adversarial attack. Finally, the adversarial examples during training are continuously varying based on the updated trainable weights and the gradients backpropagated to the input representations. Thus, it is possible that the model gets trained on adversarial samples that are similar to the ones generated in the validation and test sets after applying the attack which ultimately means overfitting to the adversarial samples. Due to the dynamic nature of generating these adversarial samples during training, tracking these samples and comparing them to the ones generated in the validation and test sets is so complex that it needs to be studied in separate research, and hence we leave it for future work.
Third, we notice the outperformance of the LSTM/GRU-based models compared to MHA-based models in most scenarios. This can be attributed to the relatively small amount of data used for training the models. It has been mentioned in previous works the tendency of transformer-based models to overfit when less data is used for training which hinders these models’ generalization (Zeyer et al. Reference Zeyer, Bahar, Irie, Schlüter and Ney2019). We believe that using a larger amount of data for training has the potential to yield better results for the transformer-based models which can be investigated in future works.
Fourth, we notice the drop in performance of the LSTM and GRU models when adding the 1D CNN layer. The working mechanism of the 1D CNN layer relies on mapping multiple inputs in a specific frame (kernel) into a single value by applying the convolution operation. This type of mapping enables the models to pay attention to multiple positions (token features) using one kernel, however, this kind of pooling might have been the reason for losing the signal of polarity in some sequences that negatively affected the performance. Further study is needed to be done in future work to further investigate this phenomenon.
Finally, to the best of our knowledge, this is the first work to employ AT to improve the robustness of Arabic NLP models. Works on other languages have shown worse results compared to ours. Xu et al. (Reference Xu, Li, Zhang, Zheng, Chang, Hsieh and Huang2022) for example, have employed PGD-AWP for AT. However, the accuracy of the model under attack did not exceed 57.60%.
5. Conclusion
In this work, we have designed a framework to attack trained models by means of synonym replacement. These synonyms are automatically selected from a predefined dictionary containing the 10 nearest neighbors to each token based on the AraVec pre-trained embeddings and the cosine similarity metric. After proving the effectiveness of this framework in attacking the trained models, we experimented with three different scenarios to reinforce these models and make them robust to such attacks. These scenarios revolved around training the models on perturbed samples and perturbed weights to enable the model to learn the weights necessary for making it robust. It was proven that the best scenario was the one that employed both clean and adversarial examples in training in conjunction with employing weight perturbation. To our knowledge, this is the first work in the Arabic NLP domain that aims to build a robust deep learning framework via employing AT.
For future work, we propose investigating the point mentioned in the Remarks Subsection where after adversarially training the models, these models become apparently better at predicting the adversarial examples as compared to the clean ones. We presented three hypotheses including the suspicion that the distribution of the adversarial examples in both the validation and test sets becomes closer to the adversarial examples in the training set. However, due to the continuously changing nature of these generated samples during training, which is attributed to the continuous updating of the models’ weights that ultimately affects the gradients, tracking these samples in the training set requires separate research to study this phenomenon. On the other hand, this research was based on the LABR dataset which was collected from book readers’ reviews. This sampling of reviews is actually a sampling of reviews generated by a more educated class of people who tend to use MSA in their writings. Hence, the effectiveness of the proposed framework can be more deeply evaluated if it is tested against dialectical contents.
Competing interests and funding
The authors of this manuscript declare that they have no competing interests with any party.
Appendices
A. Remaining models’ progress plots