1. Introduction
The exponential increase in textual content due to the widespread use of social media platforms renders human moderation of such information untenable (Cao, Lee, and Hoang Reference Cao, Lee and Hoang2020). Governments, media organizations, and researchers now view the prevalence of hate speech on online social media platforms as a major problem, particularly given how quickly it spreads and encourages harm to both individuals and society. Hate speech (Nockleby Reference Nockleby1994) is any communication that intends to attack the dignity of a group based on characteristics such as race, gender, ethnicity, sexual orientation, nationality, religion, or other features. With the advancement of natural language processing (NLP), numerous studies have suggested methods to detect hate speech automatically using traditional machine learning (ML) (Dinakar, Reichart, and Lieberman Reference Dinakar, Reichart and Lieberman2011; Reynolds, Kontostathis, and Edwards Reference Reynolds, Kontostathis and Edwards2011; Dadvar, Trieschnigg, and Jong Reference Dadvar, Trieschnigg and Jong2014) and deep learning approaches (Waseem and Hovy Reference Waseem and Hovy2016; Badjatiya et al. Reference Badjatiya, Gupta, Gupta and Varma2017; Agrawal and Awekar Reference Agrawal and Awekar2018). However, it is crucial for artificial intelligence (AI) tools not only to identify hate speech automatically but also to generate the implicit bias that is present in the post in order to explain why it is hated. The advent of explainable AI (Gunning et al. Reference Gunning, Stefik, Choi, Miller, Stumpf and Yang2019) has necessitated the provision of explanations and interpretations for decisions made by ML algorithms. This requirement is crucial for establishing trust and confidence in the deployment of AI models. Additionally, recent legislation in Europe, such as the General Data Protection Regulation (Regulation Reference Regulation2016), has implemented a “right to explanation” law, further emphasizing the need for interpretable models. Consequently, there is a growing emphasis on the development of models that prioritize interpretability rather than solely focusing on improving performance through increased model complexity.
Stereotypical bias (Cuddy et al. Reference Cuddy, Fiske, Kwan, Glick, Demoulin, Leyens, Bond, Croizet, Ellemers and Sleebos2009), a common unintentional bias, can be based on specific aspects such as skin tone, gender, ethnicity, demography, disability, Arab-Muslim origin, etc. Stereotyping is a cognitive bias that permeates all aspects of daily life and is firmly ingrained in human nature. Social stereotypes have a detrimental influence on people’s opinions of other groups and may play a crucial role in how people interpret words aimed toward minority social groups (Sap et al. Reference Sap, Card, Gabriel, Choi and Smith2019a). For example, earlier studies have demonstrated that toxicity detection models correlate texts with African-American English traits with more offensiveness than texts lacking such qualities (Davidson, Bhattacharya, and Weber Reference Davidson, Bhattacharya and Weber2019).
In the past decade, extensive research has been conducted to develop datasets and models for the automatic detection of online hate speech in the English language (Waseem and Hovy, Reference Waseem and Hovy2016; Badjatiya et al., Reference Badjatiya, Gupta, Gupta and Varma2017; Agrawal and Awekar, Reference Agrawal and Awekar2018). However, there is a noticeable scarcity of hate speech detection work in the Hindi language, despite its status as the fourth-most-spoken language globally, widely used in South Asia. Existing studies in this domain have primarily focused on enhancing the performance of hate speech detection using various models, often neglecting the crucial aspect of explainability. The emergence of explainable AI has now necessitated the provision of explanations and interpretations for decisions made by ML algorithms, becoming a critical requirement in this field. For instance, debiasing techniques that incorporate knowledge of the toxic language may benefit from extra information provided by in-depth toxicity analyses in the text (Ma et al. Reference Ma, Sap, Rashkin and Choi2020). Furthermore, thorough descriptions of toxicity can make it easier for people to interact with toxicity detection systems (Rosenfeld and Richardson Reference Rosenfeld and Richardson2019).
To fill this research gap, in this work, we create a benchmark Hindi hate speech explanation (HHES) dataset that contains the stereotypical bias and target group category of a toxic post. To create the HHES dataset, we manually translate the existing English Social Bias Inference Corpus (SBIC) (Sap et al. Reference Sap, Gabriel, Qin, Jurafsky, Smith and Choi2020a) dataset. Now, we have to develop an efficient multitask (MT) framework that can solve two different categories of tasks simultaneously, that is, (i) sequence generation task (generate stereotypical bias as explanation) and (ii) classification task (identify the target group category).
Humans have the ability to learn multiple tasks simultaneously and apply the knowledge learned from one task to another task. To mimic this quality of human intelligence, researchers have been working on multitask learning (MTL) (Caruana Reference Caruana1997) which is a training paradigm in which a model is trained with data from different closely related tasks in an attempt to efficiently learn the mapping and connection between these tasks. There have been many works that have shown that solving a closely related auxiliary task along with the main task increases the performance of the primary tasks (such as cyberbullying detection) (Maity and Saha Reference Maity and Saha2021b), complaint identification (Singh et al. Reference Singh, Saha, Hasanuzzaman and Dey2022), and tweet act classification (Saha et al. Reference Saha, Upadhyaya, Saha and Bhattacharyya2022). A typical MT model consists of a shared encoder that contains representations from data of different tasks and several task-specific layers or heads attached to that encoder. However, there are many drawbacks of this approach such as negative transfer (Crawshaw Reference Crawshaw2020) (where multiple tasks instead of optimizing the learning process start to hurt the training process), model capacity (Wu Reference Wu2019) (if the size of the shared encoder becomes too large, then there will be no transfer of information across different tasks ), or optimization scheme (Wu Reference Wu2019) (how to assign weights to different tasks during training). There are also several scalability issues with this approach of multitasking such as adding task-specific heads every time a new task has been introduced or changing the complete model architecture whenever a new combination of tasks has been introduced.
To overcome the challenges of MTL, we propose the use of a generative model to solve two different categories of tasks: classification (target group category) and generation (stereotypical bias). Rather than employing two separate models to address these tasks, we present a commonsense-aware unified generative MT framework that can solve both tasks simultaneously in a text-to-text generation manner. We converted the classification task into a generation task, where the target output sentence is the concatenation of the classification task’s output tokens. In our proposed model, the input is text, such as a social media post, and the output is also text, representing the concatenation of stereotypes and target groups separated by a special character. For instance, given the input post “Bitches love Miley Cyrus and Rihanna because they speak to every girl’s inner ho,” the corresponding output or target sequence is “ $\lt$ Women are sexually promiscuous $\gt$ $\lt$ Gender $\gt$ .” In this example, “Women are sexually promiscuous” represents the stereotypical bias, and “Gender” is the target group category. As sentient beings, we use our common sense to establish connections between what is explicitly said and inferred. We employed ConceptNet to generate commonsense knowledge to capture and apply common patterns of real-world knowledge in order to draw conclusions or make decisions about a given post. For example, if the input sentence is “I was just pretending to be retarded!,” then some of the generated commonsense reasonings by ConceptNet are (i) “pretend requires imagination” and (ii) “retard is similar in meaning to an idiot”.
To sum up, our contributions are twofold:
-
1. HHES, a new benchmark dataset for explainable hate speech detection with target group category identification in the Hindi language, has been developed.
-
2. To simultaneously solve two tasks, that is, stereotypical bias/explanation (generation task) and identifying target group (classification task), a commonsense-aware unified generative framework ( $CGenEx$ ) with reinforcement learning-based training has been proposed.Footnote a
The organization of this article is as follows. A survey of all the previous works in this domain is explained in Section 2. Section 3 describes the process of dataset creation in detail. Section 4 explains the proposed methodology, and Section 5 describes the experimental settings and results. This part also contains a detailed error analysis of our results.
2. Related works
Hate speech is very reliant on linguistic subtlety. Researchers have recently provided a lot of attention to automatically identifying hate speech in social media. In this section, we will review recent works on detecting and explaining hate speech.
2.1. Hate speech detection
Kamble et al. (Kamble and Joshi Reference Kamble and Joshi2018) explored hate speech detection in code-mixed Hindi-English tweets. By employing three deep learning models with domain-specific embeddings, they achieved a significant improvement of 12 percent in the F1 score compared to previous work that used statistical classifiers. The authors emphasized the ability of their models to capture the semantic and contextual aspects of hate speech, highlighting the value of domain-specific word embeddings. In this paper (Kumar et al. Reference Kumar, Reganti, Bhatia and Maheshwari2018), the authors address the increasing incidents of aggression and related behaviors on social media platforms by developing an aggression-annotated dataset of Hindi-English code-mixed data from Twitter and Facebook. The dataset, consisting of approximately 18k tweets and 21k Facebook comments, is annotated with a hierarchical tagset of aggression levels and types. This annotated dataset serves as a valuable resource for understanding and automatically identifying aggression, trolling, and cyberbullying on social media platforms. Maity and Saha (Reference Maity and Saha2021a) introduce a benchmark corpus specifically designed for detecting cyberbullying targeted at children and women in the context of Hindi-English code-mixed language. By combining BERT, CNN, GRU, and Capsule networks, the authors develop a powerful model for classification, surpassing both conventional ML and deep neural network baselines. The model achieves an accuracy of 79.28 percent highlighting its effectiveness in identifying cyberbullying instances. By leveraging the power of the BERT language model, Paul and Saha (Reference Paul and Saha2020) developed a transformer-based method for identifying hate speech across multiple social media platforms. The approach involves fine-tuning BERT and implementing a straightforward classification model, leading to state-of-the-art performance on real-world datasets from Formspring, Twitter, and Wikipedia. Badjatiya et al. (Reference Badjatiya, Gupta, Gupta and Varma2017) address the task of hate speech detection on Twitter by leveraging deep learning architectures and semantic word embeddings. Through extensive experimentation on a benchmark dataset of 16K annotated tweets, the study demonstrates that the proposed deep learning methods outperform state-of-the-art character n-gram and word term frequency-inverse document frequency (TF-IDF) methods by a significant margin of approximately 18 percent F1 points. The authors also highlight the superiority of certain combinations, such as LSTM with random embedding and GBDT, and provide evidence of the task-specific nature of the learned embeddings through word similarity comparisons. Watanabe et al. (Reference Watanabe, Bouazizi and Ohtsuki2018) present an approach for detecting hate speech on Twitter by leveraging patterns and unigrams collected from the training set as features for ML algorithms. The proposed method achieves high accuracies of 87.4 percent for binary classification (offensive vs. non-offensive) and 78.4 percent for ternary classification (hateful, offensive, or clean). The study highlights the importance of automatically identifying hate speech to filter out offensive content and proposes future work to expand the dictionary of hate speech patterns and analyze the presence of hate speech across different demographics. Davidson et al. (Reference Davidson, Warmsley, Macy and Weber2017) address the challenge of distinguishing hate speech from other types of offensive language in social media. Using a crowd-sourced hate speech lexicon and a multi-class classification model, the study accurately categorizes tweets into hate speech, offensive language, or neutral content. The findings highlight the importance of precise classification, uncover insights into different types of hate speech, and emphasize the need to address social biases in hate speech detection algorithms.
2.2. Explainability/bias
Zaidan, Eisner, and Piatko (Reference Zaidan, Eisner and Piatko2007) proposed the concept of rationales, in which human annotators underlined a section of text that supported their tagging decision. Authors have examined that the usages of these rationales certainly improved sentiment classification performance. Mathew et al. (Reference Mathew, Saha, Yimam, Biemann, Goyal and Mukherjee2020) introduce HateXplain, a comprehensive benchmark dataset that includes annotations from multiple perspectives, such as classification labels (hate, offensive, normal), target communities, and rationales based on which labeling decisions are made. The study evaluates state-of-the-art models on this dataset and highlights the limitations of high-performing classification models in terms of explainability. Furthermore, the findings demonstrate the importance of incorporating human rationales in training models to mitigate unintended bias and improve performance in hate speech detection. Sridhar and Yang (Reference Sridhar and Yang2022) developed the MixGEN model based on expert, explicit, and implicit knowledge to explain toxic text by generating the stereotype of the post. They have experimented on SBIC dataset collected from different social media like Twitter, Reddit, Gab, etc. The study highlights the strengths and weaknesses of different knowledge types and emphasizes the effectiveness of mixture and ensemble methods in leveraging diverse knowledge sources to generate high-quality text generations. To remove stereotypical bias in the hate speech detection task, authors in Badjatiya et al. (Reference Badjatiya, Gupta and Varma2019) propose a two-stage framework that includes heuristics to identify bias-sensitive words and novel strategies based on knowledge generalization for replacing these words. Experimental results using real-world datasets (WikiDetox and Twitter) demonstrate the effectiveness of the proposed methods in reducing bias without compromising overall model performance. The study highlights the potential of data correction techniques and provides qualitative analysis and examples to support the findings. Karim et al. (Reference Karim, Dey, Islam, Sarker, Menon, Hossain, Hossain and Decker2021) developed DeepHateExplainer, an explainable approach for hate speech detection in the under-resourced Bengali language. The authors preprocess Bengali texts and employ a neural ensemble method using transformer-based neural architectures to classify hate speech into political, personal, geopolitical, and religious categories. They utilize sensitivity analysis and layer-wise relevance propagation to identify important terms and generate human-interpretable explanations. Evaluations against ML and neural network baselines demonstrate the superior performance of DeepHateExplainer. The study acknowledges potential limitations due to limited labeled data and proposes future directions for improvement and expansion.
2.3. Text generation
Models such as GPT-2 (Radford et al. Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019) and GPT-3 are decoder-only transformer models that have been pre-trained on a large amount of text data that can generate fluent, coherent, and consistent text. Encoder-decoder transformers consisting of BART (Lewis et al. Reference Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov and Zettlemoyer2020) and T5 (Raffel et al. Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020) have shown massive improvements and success in many NLP tasks such as summarization and translation. Recently, there are many attempts to use these generative models in solving non-generational tasks. Yan et al. (Reference Yan, Dai, Ji, Qiu and Zhang2021) used the BART model to solve the task of aspect-based sentiment analysis. They proposed to convert all the aspect-based sentiment analysis tasks to a unified generation task. The BART model is implemented to generate the target sequence in an end-to-end process based on unified task generation. Similarly, Wang et al. (Reference Wang, Li, Yan, Yan, Wang, Wu and Xu2022) used the T5 model for solving named entity recognition as a generative problem. This enriches source sentences with task-specific instructions and answer options and then inferences from the entities and types in natural language. The T5 model is further trained for tasks such as entity extraction and entity typing.
After an in-depth literature review, we can conclude that most of the works on hate speech detection are in English, and there is no such work on the explainability of hate speech by generating the internal stereotypical bias in the Hindi language. In this work, we attempt to bridge this research gap.
3. Dataset creation
This section discusses the developed benchmark Hindi HHES (stereotypes) dataset. To begin, we reviewed the literature for the existing hate speech datasets, which contain stereotypical bias and target groups. As per our knowledge, there is only one standard SBIC in English developed by Sap et al. (Reference Sap, Gabriel, Qin, Jurafsky, Smith and Choi2020a). The lack of any other publicly available dataset related to our work and the good structure of this dataset make it the perfect choice for our purpose.
Technological advancements have revolutionized the way people express their opinions, particularly in low-resource languages. India, a country with a massive internet user base of 1,010 million,Footnote b exhibits significant linguistic diversity. Among the numerous languages spoken in India, Hindi holds a prominent position as one of the official languages,Footnote c with over 691 million speakers.Footnote d Consequently, a substantial portion of text conversations on social media platforms in India occurs in the Hindi language. This phenomenon highlights the significance of Hindi as the primary medium of communication for the majority of users in the country.
We have manually annotated the existing English SBIC dataset to create the Hindi HHES dataset. The annotation process was overseen by two proficient professors who have extensive expertise in hate speech and offensive content detection. The execution of the annotation task was carried out by a group of ten undergraduate students who were proficient in both Hindi and English. These students were recruited voluntarily through the department email list and were provided compensation in the form of gift vouchers and an honorarium for their participation. To ensure consistency and accuracy in the translation process, we initiated the annotation training phase with a set of gold-standard translated samples. Our expert annotators randomly selected 300 samples and manually translated them from English to Hindi. Through collaborative discussions, any differences or discrepancies in the translations were resolved, resulting in the creation of 300 gold-standard manually annotated samples encompassing toxic posts and their corresponding stereotypes. To facilitate the training of novice annotators, these annotated examples were divided into three sets, each containing one hundred samples. This division allowed for a three-phase training procedure in which novice annotators received guidance and feedback from the expert annotators. After the completion of each training phase, the expert annotators collaborated with the novice annotators to rectify any incorrect annotations and provide further guidance. Upon the conclusion of the three-phase training process, the top ten annotators were selected based on their performance. These annotators were chosen to annotate the entire dataset, and the workload was evenly divided among them. Therefore, each post was translated by one of the selected annotators. However, we acknowledge that despite our diligent efforts, there may be cases where the translation does not precisely replicate the original post due to the inherent difficulties of cross-lingual translation and the complexities of social media language.
The numbers of training, validation, and test samples in the HHES dataset are 12,110, 1,806, and 1,924, respectively. The detailed distribution of target group category classes is shown in Table 1.
Further, we have engaged three senior annotators (master’s students in linguistics) to verify the translation quality in terms of fluency (F) and adequacy (A) as mentioned in Ghosh, Ekbal, and Bhattacharyya (Reference Ghosh, Ekbal and Bhattacharyya2022). Fluency evaluates whether the translation is syntactically correct or not, whereas adequacy checks the semantic quality. Each annotator marked every translated sentence with an ordinal value from a scale of 1 to 5Footnote e for both F and A. We attain high average F and A scores of 4.23 and 4.58, respectively, illustrating that the translations are of good quality. In Table 2, some examples of the HHES dataset are shown.
4. Methodology
In this work, we have proposed CGenEx (shown in Figure 1), a commonsense-aware unified generative framework for generating stereotypical bias to explain why an input post is hateful and identify the target group category. Detailed descriptions of the proposed models are described below.
4.1. Commonsense-aware generative framework (CGenEx)
We propose a text-to-text generation paradigm for solving hate speech explanations and identifying target group categories in a unified manner. To transform this problem into a text generation problem, we first construct a natural language target sequence, $Y_i$ , for input sentence, $X_i$ , for training purposes by concatenating the explanations (stereotypical bias) and target group. Finally, the target sequence $Y_i$ is represented as
where St and Tg represent the corresponding stereotypical bias and target group of an input post $X_i$ , respectively.
We have added special characters after each task’s prediction as shown in equation (1) so that we can extract task-specific predictions during testing or inference. Now, both the input sentence and the target are in the form of natural language to leverage large pre-trained sequence-to-sequence models for solving this task of text-to-text generation.
Now the problem can be reformulated as given an input sequence $X$ , the task is to generate an output sequence, $Y^{\prime}$ , containing all the predictions defined in equation (1) using a generative model defined in equation (2):
where $G$ is a generation model. We divide our approach into three steps: (1) commonsense extraction module, (2) commonsense-aware transformer model, and (3) reinforcement learning-based training.
4.1.1. Sequence-to-sequence learning (Seq2Seq)
This problem of a text-to-text generation defined in equation (2) can easily be solved with the help of a sequence-to-sequence model which consists of two modules: (1) encoder and (2) decoder. We employed the pre-trained BART (Lewis et al. Reference Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov and Zettlemoyer2020) and T5 (Raffel et al. Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020) models as the sequence-to-sequence models in our proposed model (CGenEx).
BART: BART is an encoder-decoder-based transformer model which is mainly pre-trained for text generation tasks such as summarization and translation. BART is pre-trained with various denoising pre-training objectives such as token masking, sentence permutation, sentence rotation, etc.
T5: T5 is also an encoder-decoder-based transformer model which aims to solve all the text-to-text generation problems. The main difference between BART and T5 is the pre-training objective. In T5, the transformer is pre-trained with a denoising objective where 15 percent of the input tokens are randomly masked and the decoder tries to predict all these masked tokens, whereas during pre-training of BART, the decoder generates the complete input sequence.
4.1.2. Commonsense extraction module
Commonsense reasoning in NLP models is the ability of the model to capture and apply common patterns of real-world knowledge in order to draw conclusions or make decisions about a particular text or dataset Sap et al. (Reference Sap, Shwartz, Bosselut, Choi and Roth2020b). This type of reasoning allows the model to draw inferences. Incorporating commonsense reasoning in language models can help to more accurately capture the underlying intentions and context behind the speech. We employ a commonsense extraction module to provide more context in the form of commonsense reasoning to the input text so that the model can incorporate knowledge regarding social entities and events involved in the input text. We use ConceptNet (Speer, Chin, and Havasi Reference Speer, Chin and Havasi2017) as our knowledge base for the commonsense extraction module. At first, we feed the input text, $X_i$ , to the commonsense extraction module to extract the top five commonsense reasoning triplets using the same strategy as mentioned in Sridhar and Yang (Reference Sridhar and Yang2022) where a triplet consists of two entities and a connection/relation between these two entities which is then converted into a single sentence. Formally, to get the top five triplets from ConceptNet, we take the nouns, verbs, and adjectives from the input and search for related triplets in ConceptNet. Then, we sort them in order of the combination of their IDF score and the edge weight of the triplets and then will select the top five triplets. To obtain the final commonsense reasoning $CS$ for each input text, $X_i$ , we concatenate these five commonsense reasonings together.
4.1.3. Commonsense-aware transformer
To leverage the commonsense reasoning $CS$ obtained from the commonsense extraction module, we have proposed two variations of commonsense-aware encoder-decoder architecture (CGenEx-con and CGenEx-fuse) that are capable of incorporating $CS$ in their sequence-to-sequence learning process. We employed the pre-trained BART (Lewis et al. Reference Lewis, Liu, Goyal, Ghazvininejad, Mohamed, Levy, Stoyanov and Zettlemoyer2020) and T5 (Raffel et al. Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020) models as the base sequence-to-sequence models.
4.1.4. CGenEx-con (concatenation-based CGenEx)
Given an input text $X_i$ and corresponding commonsense reasoning $CS$ , the task to generate the target sequence, $Y_{i}^{\prime}$ , can be modeled as the following conditional text generation model: $P_{\theta }(Y_{i}^{\prime}|X_i,CS)$ , where $\theta$ is a set of model parameters. CWHSI-Con models this conditional probability as follows:
We first concatenate the tokens of the input text, $X_i$ , and the commonsense reasoning, $CS$ , to provide us with a final input sequence as follows: $T_i=X_i\oplus CS$ . Now, given a pair of input sentences and target sequence, $(T_i,Y_i)$ , the first step is to feed $T_i$ to the encoder module to obtain the hidden representation of input defined as
where $G_{Encoder}$ represents encoder computation.
After obtaining the hidden representation, $H_{EN}$ , we will feed $H_{EN}$ and all the output tokens till time step $t-1$ represented as $Y_{\lt t}$ to the decoder module to obtain the hidden state at time step $t$ as defined in the below equation:
where $G_{Decoder}$ denotes the decoder computations.
The conditional probability for the predicted output token at $t^{th}$ time step, given the input and previous $t-1$ predicted tokens is calculated by applying the softmax function over the hidden state, $H^t_{DEC}$ , as follows:
where $F_{softmax}$ represents softmax computation and $W_{Gen}$ denotes weights of our model.
4.1.5. CGenEx-fuse (fusion-based CGenEx)
To fuse the information from both commonsense and input text, we have proposed a commonsense-aware encoder (shown in Figure 2), an extension of the original transformer encoder (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017). At first, the input text, $X_i$ , is tokenized and converted into a sequence of embeddings. Then positional encodings are added to these token embeddings to retain their positional information before feeding input to the proposed commonsense-aware encoder. Our commonsense-aware encoder is composed of three sub-layers: (1) multi-head self-attention (MSA), (2) feedforward network (FFN), and (3) commonsense fusion (CSF). MSA and FFN are standard sub-layers as used in the original transformer encoder (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017). We have added a CSF sub-layer as a means to fuse the commonsense knowledge in our model which works as follows:
After obtaining the encoded representation $H_{EN}$ from the first two sub-layers (MSA and FFN), we feed this $H_{EN}$ and commonsense feature vector $G_{CS}$ to the CSF sub-layer. Unlike the standard transformer encoder where we project the same input as query, key, and value, in CWHSI-fuse, we implement a context-aware self-attention mechanism inside CSF to facilitate the exchange of information between $H_{EN}$ and $G_{CS}$ , motivated by Yang et al. (Reference Yang, Li, Wong, Chao, Wang and Tu2019). We create two triplets of queries, keys, and values matrices corresponding to $H_{EN}$ and $G_{CS}$ , respectively: ( $Q_x$ , $K_x$ , $V_x$ ) and ( $Q_{cs}$ , $K_{cs}$ , $V_{cs}$ ). Triplets ( $Q_x$ , $K_x$ , $V_x$ ) are generated by linearly projecting the input text representation, $H_{EN}$ , whereas triplets ( $Q_{cs}$ , $K_{cs}$ , $V_{cs}$ ) are obtained through gating mechanism as given in Yang et al. (Reference Yang, Li, Wong, Chao, Wang and Tu2019) which works as follows: To maintain a balance between fusing information from commonsense representation, $G_{CS}$ , and retain original information from text representation, $H_{EN}$ , we learn matrices $\lambda _{K}$ and $\lambda _{V}$ to create context-aware $K_{cs}$ and $V_{cs}$ (equation (6)):
where $U_K$ and $U_V$ are learnable parameters and matrices $\lambda _{K}$ and $\lambda _{V}$ are computed as follows:
where $W_K^X$ , $W_V^X$ , $W_K^{CS}$ , and $W_V^{CS}$ all are learnable parameters and $\sigma$ represents the sigmoid function computation.
After obtaining $K_{cs}$ and $V_{cs}$ , we apply the dot product attention-based fusion method over $Q_{x}$ , $K_{cs}$ , and $V_{cs}$ to obtain the final commonsense-aware input representation, $Z$ , computed as
At last, we feed this commonsense-aware input representation vector, $Z$ , to an autoregressive decoder following the same decoder computations defined in equation (4).
4.1.6. Reinforcement learning-based training
We initialize our model’s weights $\theta$ with weights of a pre-trained sequence-to-sequence generative model. We then fine-tune the model with the following two training objective functions: (1) negative log-likelihood, that is, the maximum likelihood estimation (MLE) objective function, which works in a supervised manner to optimize the weights, $\theta$ , as defined in equation (9):
(2) On top of the MLE objective function, we also employ a reward-based training objective function. Inspired from Sancheti et al. (Reference Sancheti, Krishna, Srinivasan and Natarajan2020), we use a BLEU (Papineni et al. Reference Papineni, Roukos, Ward and Zhu2002) based reward function. We define BLEU-based reward $R_{BLEU}$ in equation (10):
where $Y^{\prime}_i$ denotes the output sequence sampled from the conditional probability distribution at each decoding time stamp and $Y^g_i$ denotes the output sequence obtained by greedily maximizing the conditional probability distribution at each time step. To maximize the expected reward, $R_{BLEU}$ of $Y^{\prime}_{i}$ , we use the policy gradient technique which is defined in equation (11).
4.1.7. Inference
During the training process, we have access to both the input sentence ( $X_{i}$ ) and target sequence ( $Y_i$ ). Thus, we train the model using the teacher forcing approach, that is, using the target sequence as the input instead of tokens predicted at prior time steps during the decoding process. However, the inference must be done in an autoregressive manner as we don’t have the access to target sequences to guide the decoding process. After obtaining the predicted sequence $Y^{\prime}_{i}$ , we split that sequence around the special character ( $\lt \gt$ ) to get the corresponding predictions for different tasks, stereotypical bias, and target groups as described in equation (1).
5. Experiments and results
This section contains a detailed explanation of the experimental settings and the corresponding results. Certain standard baseline models are also mentioned for evaluating our results. The final part of this section is the ablation study and error analysis.
5.1. Experimental settings
In this section, we detail various hyperparameters and experimental settings used in our work. We have performed all the experiments on Tyrone machine with Intel’s Xeon W-2155 Processor having 196 Gb DDR4 RAM and 11 Gb Nvidia 1080Ti GPU. We have executed all of the models five times, and the average results have been reported. We have used mBART and mT5 as the base model for both GenEx-con and GenEx-fuse. Both these models are trained for a maximum of 110,000 epochs and a batch size of 16. Adam optimizer is used to train the model with an epsilon value of 0.00000001. All the models are implemented using scikit-LearnFootnote f and PyTorchFootnote g as a backend. For the target category detection task, accuracy and macro-F1 metrics are used to evaluate predictive performance. For the stereotype generation task, we used BLEU (Papineni et al. 2002b), ROUGE-L (ROUGE, 2004), and BERTScore (Zhang et al. Reference Zhang, Kishore, Wu, Weinberger and Artzi2019).
(i) BLEU: One of the earliest metrics to be used to measure the similarity between two phrases is BLEU. It was first proposed for machine translation and is described as the geometric mean of n-gram precision scores times a brevity penalty for short sentences. We apply the smoothed BLEU in our experiments as defined in Lin and Och (Reference Lin and Och2004).
(ii) ROUGE-L: ROUGE was first presented for the assessment of summarization systems, and this evaluation is carried out by comparing overlapping n-grams, word sequences, and word pairs. In this work, we employ the ROUGE-L version, which measures the longest common subsequences between a pair of phrases.
(iii) BERTScore: It is a similarity metric for text generation tasks based on pre-trained BERT contextual embeddings. BERTScore uses a weighted aggregate of cosine similarities between two phrases’ tokens to determine how similar they are.
5.2. Standard baselines
We have developed the following standard baselines for a fair comparison with our proposed model.
Classification baselines: We have experimented with four standard baselines as proposed in Mathew et al. (Reference Mathew, Saha, Yimam, Biemann, Goyal and Mukherjee2020) for the target group identification task. BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018) is a language model based on a bidirectional transformer encoder with a multi-head self-attention mechanism. We selected mBERT, which has been trained in 104 different languages, including Hindi. mBERT-generated sequence output has been considered as input embedding to the first three baselines.
-
1. CNN-GRU: The sequence output from BERT, with dimensions 128 $\times$ 768, is passed through 1D CNN layers. These layers consist of three kernel sizes (1, 2, 3) and one hundred filters for each size. The resulting convoluted features are then fed into a GRU layer. The hidden output from the GRU layer is passed to a fully connected (FC) layer with one hundred neurons, followed by an output softmax layer.
-
2. BiRNN: The input is fed into a bidirectional GRU (Bi-GRU) with 128 hidden units, generating a 256-dimensional hidden vector. This hidden vector is then passed to an FC layer, followed by output layers for the final class prediction.
-
3. BiRNN-attention: Similar to the previous baseline model, but with the addition of an attention layer between the Bi-GRU and FC layers.
-
4. BERT-finetune: In this approach, the mBERT model is fine-tuned by adding an output softmax layer on top of the “CLS” output.
Generation baselines: We use mBART (Liu et al. Reference Liu, Gu, Goyal, Li, Edunov, Ghazvininejad, Lewis and Zettlemoyer2020) and T5 (Raffel et al. Reference Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu2020) as the baseline text-to-text generation models. We fine-tune these models on the proposed dataset with the training objective defined in equation (9). In a single-task setting, the output sequence is either the stereotype or target group category, depending on which task you want to solve. In the case of multitasking, the output sequence is the concatenation of the stereotype and target group category.
5.3. Findings from experiments
Table 3 shows and compares the results of stereotypical bias generation (SBG) and target group category identification (TI) tasks of our proposed model, CGenEx with different baseline models in both single tasks (one task at a time) and MT settings. From all these reported results, we can conclude the following:
(1) It can be observed from Table 3 that BERT-finetune performs better in the TI task as compared to other standard baselines (CNN-GRU, BiRNN, BiRNN+attention). However, all the generative baselines based on mBART and our proposed models (CGenEx-con and CGenEx-fuse) can outperform the BERT-finetune model by a huge margin showing the superiority of pre-trained sequence-to-sequence language models.
(2) When we compare the performance of generative baselines, mBART always performs better than mT5 in both single-task (ST) and MT settings. Like, mBART-ST outperforms the mT5-ST model by a margin of (1) 18.80 percent and 21.57 percent in accuracy and F1 score for TI task, respectively, and (2) 5.09 percent, 5.85 percent, and 2.60 percent in BLEU, ROUGE-L, and BERTScore metrics for the SBG task, respectively. Similar trends are also observed for proposed models, i.e., any variants of our proposed model (CGenEx-con or CGenEx-fuse), when embedded with mBART, perform better compared to one embedded with mT5. This finding established that mBART is significantly better at handling Hindi data than mT5.
(3) It is also evident from Table 3 that our proposed model (CGenEx-fuse) always outperforms the baselines by a significant margin for both tasks. mBART:CGenEx-fuse outperforms the best generative baselines, mBART-MT, with an improvement of (i) 1.98 percent and 2.09 percent in BLEU and ROUGE-L metrics for the SBG task, respectively, and (ii) 0.94 percent in F1 score for TI task. But another variant of our proposed model (CGenEx-con) slightly underperforms the best baseline mBART-MT in single-task settings (mBART-ST: 41.87, 45.93; ST-mBART:CGenEx-con: 41.47, 45.72) and almost comparable results (mBART-MT: 72.25, 46.33; MT-mBART:CGenEx-con: 42.32, 46.38) in multitasking settings in terms of BLEU and ROUGE-L metrics. We have discussed the possible reasons for this drop in performance by CGenEx-con in Section 5.5.1.
(4) Both CGenEx-con and CGenEx-fuse outperform the mBART-MT baseline by a margin of 1.65 percent and 0.80 percent in terms of accuracy and F1 score for TI task, respectively.
(5) When we compare CGenEx-fuse and CGenEx-fuse models, we observe CGenEx-fuse model always outperforms the CGenEx-fuse for both tasks in any settings. Like, mBART:CGenEx-fuse outperforms mBART:CGenEx-con with an improvement of 2.45 percent and 1.38 percent in ROUGE-L and BERTScore metrics for the SBG task, respectively. This observation establishes the efficacy of adding a commonsense-aware encoder module in our proposed model.
(6) From Table 3, we can conclude that multitasking always performs better than single-task settings in all the variants of our proposed model and standard generative baselines. This observation establishes the benefit of multitask learning, where two or more related tasks are solved simultaneously and help each other to improve individual performance. Table 4 shows classwise precision, recall, and F1 scores of the target identification task generated by single-task and multitask variants of our proposed model (mBART:CGenEx-fuse). From this table, we can observe that except “culture” target class, the multitask model performs better for other classes than the single-task model. Confusion matrices of single-task and multitask variants of the mBART-CGenEx-fuse model for target identification task are shown in Figure 3.
We performed a statistical t-test on values of 5 runs of the proposed models and baseline models and obtained a p-value of = 0.005, which is less than 0.05 showing that the results are statistically significant. We employ SciPy library functions stats.ttest_ind Footnote h for the t-test. We have highlighted (gray color) the results in Table 3 which are statistically significant.
5.3.1. Ablation study
We performed an ablation study of our proposed models (CGenEX-con and CGenEx-fuse) to show the effect of reinforcement learning training (Table 5). It can be observed that removing the reinforcement learning (RL) training from both variations of models results in a drop in performance in both tasks, target classification, and stereotype generation. Removing RL training from CGenEx-con results in a drop in the performance of 3.25 percent in the accuracy of the target classification task and 1.94 percent in BERTScore of the stereotype generation task. Similarly, removing RL training from CGenEx-fuse results in a drop in the performance of 1.95 percent in the accuracy of the target classification task and 3.13 percent in the BERTScore of the stereotype generation task. This shows that RL training plays a vital role in improving the performance for both tasks as the BLEU-based reward function (equation (10)) encourages the model to generate an output sequence close to the golden sequence.
5.4. Performance on English SBIC dataset
We have evaluated the English SBIC dataset to assess the effectiveness of our CGenEx model on the English language. Table 6 shows that our proposed models, mBART:CGenEx-con and mBART:CGenEx-fuse, outperform the baseline models (GPT-1 and GPT-2) significantly in both single-task and MT settings. Interestingly, the MT variants of our models consistently outperform the single-task settings. This finding suggests that addressing both stereotype generation and targeted group identification together leads to improved performance in each individual task. It indicates a strong correlation between these tasks, where the performance on one task positively influences the performance on the other.
5.5. Error analysis
A detailed analysis of the results produced by the best-performing models on stereotype generation in both single-task and MT settings identified several instances where the model can falter, some of which are discussed in Table 7.
(1) Irrelevant stereotype generation: It can be seen from Table 7 that in a single-task setting, CGenEx-con models generate a stereotype for the post, but it doesn’t discuss the underlying implicit hate speech or sarcasm of the post.
(2) Wrong target group prediction: It is also evident from Table 7 that the CGenEx-con model predicts the wrong target group for the post in both multitask and single-task settings. The true target group was culture, but the model predicts it as race showing that models fail to distinguish between such closely related target groups.
(3) Wrong keyword generated: CGenEx-fuse model is able to generate a very similar stereotype as to the true stereotype in the single-task setting. However, it replaces the keyword holocaust with mass destruction, which changes the context completely.
(4) Similar meaning but different tokens: In the MT setting, we can see that CGenEx-con generates a stereotype that has a semantic overlap with the true stereotype illustrating why BERTScore has a consistently high value as compared to BLEU or ROUGE score as BERTScore measures the semantic overlap between two text embeddings.
(5) Multitask outperforms single-task model: Both CGenEx-fuse and CGenEx-con in the multitask setting are able to generate the correct stereotype for the input post as compared to single-task setting, showing that adding an additional task of target classification is helping the model to understand the underlying stereotype and bias better.
5.5.1. Failure of CGenEx-con model in the Hindi language
It can be observed that incorporating commonsense reasoning through ConceptNet with simple concatenation to input posts doesn’t improve the model’s performance. The reason for this can be attributed to the fact there is not any multilingual commonsense database available. To leverage ConceptNet for our dataset, we first translated the input post into English language and then applied the commonsense extraction module and then again translated the generated commonsense reasoning to the Hindi language. As translation happens twice, there is a high chance of semantic loss during these two steps, which leads to ineffectual commonsense reasoning. To further bolster our argument, we conducted an error analysis to analyze and study the effect of these translations to better understand the semantic loss occurring while translating, which is shown in Table 8. It can be seen from the table that both translations happen correctly in the first two examples. However, in the third example, the first translation fails to translate the input Hindi post correctly as it misses the corresponding English word for Hindi word , which completely changes the context of the input sentence. In the fourth example, the first translation happens correctly. However, the second translation (English commonsense to Hindi commonsense) fails as it mistranslated the word ethnic, which can misguide the model rather than help the model.
5.6. Limitations
In this work, we primarily focused on detecting and analyzing explicit hate speech in social media posts. Detecting sarcasm accurately in text is a complex task, as it often relies on contextual cues, tone, and understanding of cultural references. It goes beyond the scope of our current study, which primarily focuses on explicit and overt forms of hate speech. However, we acknowledge the significance of sarcasm as a potential element in hate speech and its impact on targeted groups. It is an important aspect to consider in future research and system development.
6. Conclusion and future works
As explainable AI systems help improve trustworthiness and confidence while deployed in real time, now there is a need to explain why a post is predicted as hate by any model. To encourage more research on explainable hate speech detection in Hindi (the fourth-most-spoken language in the world), we introduced a Hindi HHES dataset that contains the stereotypical bias and target group category of a toxic post. In this work, a unified generative framework (CGenEx) based on commonsense knowledge and reinforcement learning has been proposed to simultaneously solve two tasks (stereotypical bias generation and target group category identification). We showed how a multitasking problem can be formulated as a text-to-text generation task to leverage the knowledge of large pre-trained sequences to sequence models in low-resource language settings. Our proposed model (CGenEx-fuse) outperforms the best baseline with an improved F1 score of 0.80 percent and ROUGE-L of 2.09 percent for the target group identification and bias generation tasks, respectively. We have also examined that the simple concatenation-based (CGenEx-con) model is not performing as expected due to the semantic loss during English to Hindi commonsense knowledge translation.
In our future work, we plan to investigate potential modifications to the CGenEx-con model to address the challenges associated with English to Hindi commonsense knowledge translation. These modifications may involve integrating language-specific semantic and syntactic rules, utilizing bilingual resources and pre-trained models, or exploring transfer learning techniques to enhance the quality of translation. Future attempts will be made to extend explainable hate speech detection in a multimodal setting considering image and text modality.
Acknowledgments
Dr Sriparna Saha gratefully acknowledges the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya PhD Scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia) for carrying out this research.