1. Introduction
The Internet has led to a proliferation of hateful content (Suler, Reference Suler2004). However, what can be considered hate speech is subjective (Baucum, Cui, and John, Reference Baucum, Cui and John2020; Balayn et al. Reference Balayn, Yang, Szlavik and Bozzon2021). According to the United Nations,Footnote a hate speech is any form of discriminatory content that targets or stereotypes a group or an individual based on identity traits. In order to assist content moderators, practitioners are now looking into automated hate speech detection techniques. The paradigm that is currently being adopted is finetuning a pretrained language model (PLM) for hate speech detection. Akin to any supervised classification task, the first step is the curation of hateful instances. While instances of online hate speech have increased, they still form a small part of the overall content on the Web. For example, on platforms like Twitter, the ratio of hate/non-hate posts curated from the data stream is 1:10 (Kulkarni et al. Reference Kulkarni, Masud, Goyal and Chakraborty2023). Thus, data curators often employ lexicons and identity slurs to increase the coverage of hateful content.Footnote b While this increases the number of explicit samples, it comes at the cost of capturing fewer instances of implied/non-explicit hatred (Davidson et al. Reference Davidson, Warmsley, Macy and Weber2017; Silva et al. Reference Silva, Mondal, Correa, Benevenuto and Weber2021). This skewness in the number of implicit samples contributes to less information being available for the models to learn from. Among the myriad datasets on hate speech (Vidgen and Derczynski, Reference Vidgen and Derczynski2020; Poletto et al. Reference Poletto, Basile, Sanguinetti, Bosco and Patti2021) in English, only a few (Caselli et al. Reference Caselli, Basile, Mitrović, Kartoziya and Granitzer2020; ElSherief et al. Reference ElSherief, Ziems, Muchlinski, Anupindi, Seybolt, Choudhury and Yang2021; Kennedy et al. Reference Kennedy, Atari, Davani, Yeh, Omrani, Kim, Coombs, Havaldar, Portillo-Wightman, Gonzalez, Hoover, Azatian, Hussain, Lara, Cardenas, Omary, Park, Wang, Wijaya, Zhang, Meyerowitz and Dehghani2022) have annotations for “implicit” hate.
Why is implicit hate hard to detect?
It has been observed that classifiers can work effectively with direct markers of hate (Lin, Reference Lin2022; Muralikumar, Yang, and McDonald, Reference Muralikumar, Yang and McDonald2023), a.k.a explicit hate. The behavior stems from the data distribution since slurs are more likely to occur in hateful samples than in neutral ones. On the other hand, implicit hate on the surface appears lexically and semantically closer to statements that are non-hate/neutral. Inferring the underlying stereotype and implied hatefulness in an implicit post requires a combination of multi-hop reasoning with sufficient cultural reference and world knowledge. Existing research has established that even the most sophisticated systems like ChatGPT perform poorly in case of implicit hate detection (Yadav et al. Reference Yadav, Masud, Goyal, Akhtar and Chakraborty2024).
At the distribution level, the aim is to bring the surface meaning closer to its implied meaning, that is, what is said versus what is intended (ElSherief et al. Reference ElSherief, Ziems, Muchlinski, Anupindi, Seybolt, Choudhury and Yang2021; Lin, Reference Lin2022). One way to reduce the misclassification of implicit hate is to manipulate the intercluster latent space via contrastive or exemplar sampling (Kim, Park, and Han, Reference Kim, Park and Han2022). Contrastive loss similar to cross-entropy operates in a per-sample setting (Chopra, Hadsell, and LeCun, Reference Chopra, Hadsell and LeCun2005), leading to suboptimal separation among classes (Liu et al. Reference Liu, Wen, Yu and Yang2016). Another technique is to infuse external knowledge. However, without explicit hate markers, providing external knowledge increases the noise in the input signal (Lin, Reference Lin2022; Yadav et al. Reference Yadav, Masud, Goyal, Akhtar and Chakraborty2024).
Proposed framework
In this work, we examine a framework for overcoming these two drawbacks. As an alternative to the per-sample contrastive approach in computer vision tasks, adaptive density discrimination (ADD) a.k.a magnet loss (Rippel et al. Reference Rippel, Paluri, Dollár and Bourdev2016) has been proposed. ADD does not employ the most positively and negatively matching sample; instead, it exploits the local neighborhood to balance the interclass similarity and variability. Extensive literature (Rippel et al. Reference Rippel, Paluri, Dollár and Bourdev2016; Snell, Swersky, and Zemel, Reference Snell, Swersky and Zemel2017; Deng et al. Reference Deng, Guo, Xue and Zafeiriou2019) has established the efficacy and superiority of ADD for computer vision tasks over contrastive settings. We hypothesize that its advantage can be extended to natural language processing (NLP) and attempt to establish the same in this work.
For our use case, ADD can help improve the regional boundaries around implicit and non-hate samples that lie close. However, simply employing ADD in a three-way classification of implicit, explicit, and non-hate will not yield the desired results due to the semantic and lexical similarity of implicit with non-hate. We, thus, introduce external context for implicit hate samples to bring them closer to their intended meaning (Kim et al. Reference Kim, Park and Han2022), facilitating them to be sufficiently discriminated. To this end, we employ implied/descriptive phrases instead of knowledge tuples or Wikipedia summaries based on empirical findings (Lin, Reference Lin2022; Yadav et al. Reference Yadav, Masud, Goyal, Akhtar and Chakraborty2024) that the latter tend to be noisy if the tuples are not directly aligned with the entities in the input statement. As outlined in Figure 1, our proposed pipeline focused inferential adaptive density discrimination (FiADD) improves the detection of implicit hate by employing distance-metric learning to set apart the class distributions in conjunction with reducing the latent space between the implied context and implicit hate. The dual nature of the loss function is aided by nonuniform weightage, with a focus on penalizing samples near the discriminant boundary.
Through extensive experiments, we observe that FiADD variants improve overall as well as implicit class macro-F1 for LatentHatred (ElSherief et al. Reference ElSherief, Ziems, Muchlinski, Anupindi, Seybolt, Choudhury and Yang2021), ImpGab (Kennedy et al. Reference Kennedy, Atari, Davani, Yeh, Omrani, Kim, Coombs, Havaldar, Portillo-Wightman, Gonzalez, Hoover, Azatian, Hussain, Lara, Cardenas, Omary, Park, Wang, Wijaya, Zhang, Meyerowitz and Dehghani2022), and AbuseEval (Caselli et al. Reference Caselli, Basile, Mitrović, Kartoziya and Granitzer2020) datasets. Our experimental results further suggest that our framework can generalize to other tasks where surface and implied meanings differ, such as humor (Labadie Tamayo, Chulvi, and Rosso, Reference Labadie Tamayo, Chulvi and Rosso2023), sarcasm (Abu Farha et al. Reference Abu Farha, Oprea, Wilson and Magdy2022; Frenda, Patti, and Rosso, Reference Frenda, Patti and Rosso2023), irony (Van Hee, Lefever, and Hoste, Reference Van Hee, Lefever and Hoste2018), stance (Mohammad et al. Reference Mohammad, Kiritchenko, Sobhani, Zhu and Cherry2016), etc. To establish that our results are not BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) specific, we also experiment with HateBERT (Caselli et al. Reference Caselli, Basile, Mitrović and Granitzer2021a), XLM (Chi et al. Reference Chi, Huang, Dong, Ma, Zheng, Singhal, Bajaj, Song, Mao, Huang and Wei2022), and LSTM (Hochreiter and Schmidhuber, Reference Hochreiter and Schmidhuber1997).
Contributions
In short, we make the following contributions through this study:Footnote c
-
• We perform a thorough literature survey of the implicit hate speech datasets. For the datasets employed in this study, we establish the closeness of implicitly hateful samples from non-hateful ones (Section 3) and use it to motivate our model design (Figure 1).
-
• We adopt ADD for the NLP setting and employ it to propose FiADD). The variants of the proposed setup allow it to be used as a pluggable unit for the PLM finetuning pipeline for the task of hate speech detection as well as other implicit text-based tasks (Section 4).
-
• We manually generate implied explanations/descriptions for $798$ and $404$ implicit hate samples for AbuseEval and ImpGab, respectively. These annotations contribute to corpora of unmasking implicit hate (Section 5).
-
• Our exhaustive experiments, analyses, and ablations highlight how FiADD compares with the cross-entropy loss on three hate speech datasets. We also extend our analysis to three other SemEval tasks to demonstrate the model’s generalizability (Section 6).
-
• We perform an analysis to assess how the latent space evolves under FiADD (Sections 7).
Research scope and social impact
Early detection of implicit hate will help reduce the psychological burden on the target groups, prevent conversation threads from turning more intense, and also assist in counter-hate speech generation. It is imperative to note the limitations of PLMs in understanding implicit hate speech. We attempt to overcome this by incorporating latent space alignment of surface and implied context. However, PLMs cannot replace human content moderators and can only be assistive.
2. Related work
Given that this study proposes a distance-based objective function primarily for implicit hate detection, the literature survey focuses on three main aspects—(i) implicit hate datasets, (ii) implicit hate detection, and (ii) improvement in the classification tasks via distance-based metrics. To determine the relevant literature for implicit hate within the vast hate speech literature, we make use of the up-to-date hate speech corpusFootnote d (Vidgen and Derczynski, Reference Vidgen and Derczynski2020) as well ACL Anthology. The keywords used to search for relevant literature on the two corpora were “implicit” and “implicit hate,” respectively.
Implicit hate datasets
The task of classifying hateful texts has led to an avalanche of hate detection datasets and models (Schmidt and Wiegand, Reference Schmidt and Wiegand2017; Tahmasbi and Rastegari, Reference Tahmasbi and Rastegari2018; Vidgen and Derczynski, Reference Vidgen and Derczynski2020). Before discussing the literature, it is imperative to point out that issues with generalizable (Yin and Zubiaga, Reference Yin and Zubiaga2021), biasing (Balayn et al. Reference Balayn, Yang, Szlavik and Bozzon2021; Garg et al. Reference Garg, Masud, Suresh and Chakraborty2023), adversary (Masud et al. Reference Masud, Singh, Hangya, Fraser and Chakraborty2024b), and outdated benchmarks (Masud et al. Reference Masud, Khan, Goyal, Akhtar and Chakraborty2024a) are prevalent for hate speech detection at large and forms an active area of research.
Focusing on implicit hate datasets, we searched the hate speech database (Vidgen and Derczynski, Reference Vidgen and Derczynski2020) with the keyword “implicit” as an indicator of whether the label set contains “implicit” labels and obtained $4$ results. DALC (Caselli et al. Reference Caselli, Schelhaas, Weultjes, Leistra, van der Veen, Timmerman and Nissim2021b) is a Dutch dataset consisting of $8k$ tweets curated from Twitter, labeled for the level of explicitness as well as the target of hate. Meanwhile, ConvAbuse consists of 4k English samples obtained from in-the-wild human-AI conversations with AI chatbots. Each conversation is marked for the degree of abuse (1 to -3) and directness (explicit or implicit). The other two datasets are also in English. AbuseEval (Caselli et al. Reference Caselli, Basile, Mitrović, Kartoziya and Granitzer2020) is $14k$ , Twitter labeled for “abusiveness” and “explicitness.” On the other hand, ImpGab (Kennedy et al. Reference Kennedy, Atari, Davani, Yeh, Omrani, Kim, Coombs, Havaldar, Portillo-Wightman, Gonzalez, Hoover, Azatian, Hussain, Lara, Cardenas, Omary, Park, Wang, Wijaya, Zhang, Meyerowitz and Dehghani2022) consists of 27k posts from Gab, which contain a hierarchy of annotations about the type and target of hate.
Meanwhile, from the ACL Anthology (we looked at the results from the first two pages out of 10), we discovered four more datasets. LatentHatred is the most extensive and most widely used implicit hate speech dataset. It consists of $21k$ Twitter samples labeled for implicit hate as well as $6$ additional sub-categories of implicitness. It also contains free-text human annotations explaining the implied meaning behind the implicit posts. Along similar lines, SBIC (Ocampo et al. Reference Ocampo, Sviridova, Cabrio and Villata2023c) is also a collection of $44k$ implicit posts curated from online platforms with human-annotated explainations. However, unlike complete sentences in LatentHatred SBIC focuses on single-phrased explainations. Further, SBIC does not have a direct marker for the explicitness of the post, and by default, all posts are implicit. For specific target groups and types of hate speech, such as sexism (Kirk et al. Reference Kirk, Yin, Vidgen and Röttger2023) or xenophobia against immigrants (Sánchez-Junquera et al. Reference Sánchez-Junquera, Chulvi, Rosso and Ponzetto2021), researchers have also explored imploying multiple-level annotations as a means of obtaining granular label spans as explainations for the hateful instance. It serves as an alternative to free-text annotations, allowing for more structured and linguistic analysis (Merlo et al. Reference Merlo, Chulvi, Ortega-Bueno and Rosso2023) of implicitness. Further, building upon the multimodal hate meme dataset MMHS150K (Gomez et al. Reference Gomez, Gibert, Gomez and Karatzas2020), proposed a multimodal implied hate dataset (Botelho, Hale, and Vidgen, Reference Botelho, Hale and Vidgen2021) with the different types of implicitness occurring as a combination of the text and image.
More recently, the ISHate (Ocampo et al. Reference Ocampo, Sviridova, Cabrio and Villata2023c) dataset has been curated by combining existing hate speech and counter-hate speech datasets and relabeling the samples for explicit–implicit markers, consisting of $30k$ samples labeled as explicit, implicit, subtle, or non-hate. It is interesting to note that in their analysis, the authors do not showcase how the different datasets interact with each other in the latent space. We hypothesize that the performance improvements in hate detection are obtained not as a result of modeling but due to the fact that these samples are obtained from distinct datasets, that is, distinct distributions. For example, counter-hate datasets do not contribute to the non-hate class. Meanwhile, the majority of implicit hate samples come from LatentHatred and ToxiGen (Hartvigsen et al. Reference Hartvigsen, Gabriel, Palangi, Sap, Ray and Kamar2022). The latter is a curation of around 1 M toxic and implicit statements obtained by controlled generation.
Modeling implicit hate speech in NLP
Despite a large body of hate speech benchmarks, the majority of datasets fail to demark implicit hate. Even during the annotation process, fine-grained variants of offensiveness Founta et al. (Reference Founta, Djouvas, Chatzakou, Leontiadis, Blackburn, Stringhini, Vakali, Sirivianos and Kourtellis2018); Kulkarni et al. (Reference Kulkarni, Masud, Goyal and Chakraborty2023); Kirk et al. (Reference Kirk, Yin, Vidgen and Röttger2023) like abuse, provocation, and sexism are favored over the nature of hate, that is, explicit vs implicit. As the annotation schemas have a direct impact on the downstream tasks (Rottger et al. Reference Rottger, Vidgen, Hovy and Pierrehumbert2022), the common vouge of binary hate speech classification, while easier to annotate and model, focuses on explicit forms of hate. It also comes at the cost of not analyzing the erroneous cases where implicit hate is classified as neutral content. This further motivates us to examine the role of PLMs in three-way classification in this work.
Given the skewness in the number of implicit hate samples in a three-way classification setup, data augmentation techniques have been explored. For example, (Ocampo et al. Reference Ocampo, Sviridova, Cabrio and Villata2023c) employed multiple data augmentation like substitution and back translation and observed that only when multiple techniques are combined did they surpass the finetuned HateBERT in performance. Adversarial data collection (Ocampo, Cabrio, and Villata, Reference Ocampo, Cabrio and Villata2023a) and LLM-prompting (Kim et al. Reference Kim, Park, Namgoong and Han2023) have also been explored for augmenting and improving implicit hate detection.
Language models are being employed not only to augment the implicit hate corpora but also to detect hate (Ghosh and Senapati, Reference Ghosh and Senapati2022; Plaza-del Arco, Nozza, and Hovy, Reference Plaza-del Arco, Nozza and Hovy2023). With the recent trend of prompting generative large language models (LLMs), hate speech detection is now being evaluated under zero-shot (Nozza, Reference Nozza2021; Plaza-del Arco et al. Reference Plaza-del Arco, Nozza and Hovy2023; Masud et al. Reference Masud, Singh, Hangya, Fraser and Chakraborty2024) and few-shot settings as well. An examination of the hate detection techniques under fine-grained hate speech detection has revealed that traditional models, either statistical (Waseem and Hovy, Reference Waseem and Hovy2016; Davidson et al. Reference Davidson, Warmsley, Macy and Weber2017) or deep learning-based (Badjatiya et al. Reference Badjatiya, Gupta, Gupta and Varma2017; Founta et al. Reference Founta, Chatzakou, Kourtellis, Blackburn, Vakali and Leontiadis2019), are characterized by a low recall for hateful samples (Kulkarni et al. Reference Kulkarni, Masud, Goyal and Chakraborty2023). To increase the information gained from the implicit samples, researchers are now leveraging external context.
Studies have mainly explored the infusion of external context in the form of knowledge entities, either in the form of knowledge-graph (KG) tuples (ElSherief et al. Reference ElSherief, Ziems, Muchlinski, Anupindi, Seybolt, Choudhury and Yang2021) or Wikipedia summaries (Lin, Reference Lin2022). However, both works have observed that knowledge infusion at the input level lowered the performance on fine-grained implicit categories. An examination of the quality of knowledge tuple (Yadav et al. Reference Yadav, Masud, Goyal, Akhtar and Chakraborty2024) infusion for implicit hate reveals that KG tuples fail to enlist information that directly connects with the implicit entities, acting more as noise than information. Apart from textual features, social media platform-specific features like user metadata, user network, and conversation thread/timeline can also be employed to improve the detection of hate and capture implicitness in long-range contexts (Ghosh et al. Reference Ghosh, Suri, Chiniya, Tyagi, Kumar and Manocha2023). However, such features are platform-specific, complex to curate, and resource-intensive to operate (in terms of storage and memory to train network embeddings). From the latent space perspective, researchers have explored how the infusion of a common target group can bring explicit and implicit samples closer (Ocampo, Cabrio, and Villata, Reference Ocampo, Cabrio and Villata2023b), aiding in the detection of the latter. While the idea is intuitive since implicit hate and explicit slurs are specific to a target group, here, the extent of overlap in the case of multiple target groups or intersectional identities is not adequately addressed.
Distance-metric learning
Akin to supervised classification task in NLP, all the setups reviewed so far use an encoder-only BERT-based + cross entropy (CE) for finetuning. Therefore, in our study, BERT + CE acts as a baseline. Despite its popularity, CE’s impact on the inter/intra-class clusters is suboptimal (Liu et al. Reference Liu, Wen, Yu and Yang2016). Since classification tasks can be modeled as obtaining distant clusters per class, one can exploit clustering and distance-metric approaches to enhance the boundary among the labels, leading to improved classification performance. Distance-metric learning-based methods employ either deep divergence via distribution (Cilingir, Manzelli, and Kulis, Reference Cilingir, Manzelli and Kulis2020) or point-wise norm (Chopra et al. Reference Chopra, Hadsell and LeCun2005). The most popular deep metric learning is the contrastive loss family (Chopra et al. Reference Chopra, Hadsell and LeCun2005; Schroff, Kalenichenko, and Philbin, Reference Schroff, Kalenichenko and Philbin2015; Chen et al. Reference Chen, Chen, Zhang and Huang2017). In order to improve upon the CE loss and benefit from the one-to-one mapping of the implicit hate and its implied meaning, contrastive learning has been explored (Kim et al. Reference Kim, Park and Han2022), which has only provided slight improvement.
However, like cross-entropy, contrastive loss operates on a per-sample basis; even when considering positive and negative exemplars, they are curated on a per-sample basis. Clustering-inspired methods (Rippel et al. Reference Rippel, Paluri, Dollár and Bourdev2016; Song et al. Reference Song, Jegelka, Rathod and Murphy2017) have sought to overcome this issue by focusing on subclusters per class. ADD a.k.a magnet loss (Rippel et al. Reference Rippel, Paluri, Dollár and Bourdev2016) specifically lends a good starting point to operate the shift in the intercluster distance to extend to our use case. Based on the fact that ADD has surpassed contrastive losses in other tasks, we use ADD as a starting point and improve upon its formulation for implicit detection. As the current ADD setup fails to account for the implied meaning, we infuse external information into the latent space as an implied/inferential cluster.
3. Intuition and background
This section attempts to establish the need for distance-metric-based learning for the task of hate speech detection. Inspired by our initial experiments, we provide an intuition for ADD.
Hypothesis
A manual inspection of the hate speech datasets reveals that non-hate is closer to implicit hate than explicit hate. We, thus, measure the intercluster distance between non-hate and implicit hate compared to non-hate and explicit hate.
Setup
For three implicit hate speech datasets, that is, LatentHatred ImpGab and AbuseEval, we embed all the samples for a dataset in the latent space using $768$ dimensional CLS embedding from BERT. The embeddings are not finetuned on any dataset or task related to hate speech so as to reduce the impact of confounding variables. We then consider 3 clusters directly adopted from the implicit, explicit, and non-hate classes and record the pairwise average linkage distance (ALD) and average centroid linkage distance (ACLD) among these clusters.
As the name suggests, for ACLD, we first obtain the embedding for the center of each cluster as a central tendency (mean or median) of all its representative samples and then compute the distance between the centers. This distance indicates the overall closeness of the two centers, which, in our case, measures the extent of similarity between the two classes. We also assess the latent space more granularly via ALD. In ALD, the distance between two clusters is obtained as the average distance between all possible pairs of samples where each element of the pair comes from a distinct group. It allows for a more fine-grained evaluation of the latent space, as not all data points are equidistant from each other or their respective centers. Formally, consider a system with E ( $\mathbb{R}\in d$ ) points, where each point ( $e_i$ ) belongs to one of the N clusters $c^n(e_i)$ and $\mu ^n = \frac{1}{|c^n|}\sum _{e_i \in c^n}e_i$ is the cluster center. For clusters $a$ and $b$ , $ACLD^{a,b}=dist(\mu ^a,\mu ^b)$ . Meanwhile, $ALD^{a,b}=\frac{1}{|c^a| * |c^b|}\sum _{e_i \in c^a \& e_j \in c^b}dist(e^i,e^j)$ where ( $c^a(e_i)\neq c^b(e_j)$ ).
The intuition behind using both ACLD and ALD stems from the fact that online hate speech is part of the larger discourse on the Web. Thus, it is possible that at the level of individual datapoint labeling an isolated instance as hateful is hard. Furthermore, some implicit samples may be closer to the explicit hate samples in terms of lexicon or semantics. On the other hand, it is also possible for some non-hate samples to contain slurs that are commonplace and context-specific but not objectionable within the community (Diaz et al. Reference Diaz, Amironesei, Weidinger and Gabriel2022; Röttger et al. Reference Röttger, Vidgen, Nguyen, Waseem, Margetts and Pierrehumbert2021). ACLD and ALD allow us to capture these dynamics at a macroscopic and a microscopic level.
Observation
From Table 1, we observe that under both ALD and ACLD, non-hate is closer to implicit samples. As expected, ALD shows more variability than ACLD. It follows from the fact that the mere presence of a keyword/lexicon does not render a sample as hateful.
Stemming from these observations, we see a clear advantage of employing a distance-metric approach that can exploit the granular variability in the latent space. Adaptive density discrimination (ADD) based clustering loss, which optimizes the inter and intra-clustering around the local neighborhood, directly maps to our problem of regional variability among the hateful and non-hateful samples. Further, our observations motivate the penalization of samples closer to the boundary responsible for increasing variability. The proposed model, as motivated by our empirical observations, is outlined in Figures 1 and 2.
3.1 Background on adaptive density discrimination
Here, we briefly outline ADD, which forms the backbone of our proposed framework. ADD is a clustering-based distance-metric. It evaluates the local neighborhood or clusters among the samples after each training iteration. At each epoch, after the training samples have been encoded into vector space, ADD clusters all data points within a class into $K$ representative local groups via K-means clustering. The subclusters within a class help capture the inter/intra-label similarity around the local neighborhood. If there are $N$ classes, then each training sample will belong to one of the $N*K$ subclusters.
Given that mapping and tracking distances among all N*K groups are computationally expensive, ADD randomly selects a reference/seed cluster $I_s^c$ representing class $C$ and then picks $M$ imposter clusters from local neighborhood $I_{s1}^{c'}, \ldots, I_{sm}^{c'}$ but from disparate classes ( $ c\not = c'$ ) based on their proximity to seed cluster. To understand the concept of seed and imposter cluster better, consider the three-way hate speech classification task with implicit, explicit, and non-hate labels. As we aim to distinguish implicit hate speech better, we select one of the implicit hate subclusters as the seed. Consequently, the imposter clusters will be from explicit hate or non-hate, where implicit hate can be misclassified. ADD then samples $D$ points uniformly at random from each sample cluster. For the $d^{th}$ data point in $m^{th}$ cluster, $r_d^m$ is its encoded vector representation, with $C(.)$ representing the class for the sample under consideration. Subsequently, $\mu ^m = \frac{1}{D}\sum _{d=1}^{D}r_d^m$ acts the mean representation of $m^{th}$ cluster. Here, ADD applies Equation 1 to discriminate the local distribution around a point:
Here, $\alpha$ is a scalar margin for the cluster separation gap. The variance of all samples away from their respective centers is approximated via $\sigma ^2 =\frac{1}{MD-1}\sum _{m=1}^{M}\sum _{d=1}^{D}\left \|r_d^m - \mu ^m \right \|_2^2$ .
After each iteration, as the embedding space gets updated, so does each of the subclusters; this lends to a dynamic nature to ADD. It allows for the selection of random subclusters and data points after each iteration. The overall loss is computed via Equation 2.
4. Proposed method
The proposed FiADD framework consists of a standard finetuning pipeline with encoder-only PLM followed by a projection layer $R_h$ and a classification head (CH). To reduce the distance between the implicit hate (imp) and implied clusters (inf), FiADD measures, the average distance of implicit points from implied meaning as a ratio of its distance to explicit and non-hate subspaces. During the PLM finetuning, our setup combines with cross-entropy loss to improve the detection of hate. An overview of FiADD’s architecture is reflected in Figure 2. For each training instance $(x_d, y_d) \in X$ , with $x_d$ input and $y_d$ label, $x_p=PLM(x_d)$ is the encoded representation obtained from the PLM. The encodings are projected to obtain $r_d = R_h(x_p)$ . Here, $x_p \in \mathbb{R}^{768}$ and $r_d \in \mathbb{R}^{128}$ as $r_d \ll x_d$ allows for faster clustering.
Novel component: inferential infusion
As each output label $y_d$ belongs to one of the distinct classes ( $c_i \in C$ ), we employ the respective embeddings $r_d$ and offline K-means algorithm to obtain $K$ subclusters per class. For implicit hate samples, the latent representation of their implied/inferential counterparts $\tilde{x_d}$ is denoted as $\tilde{r_d} = R_h(\tilde{x_d})$ . If $r_1^m, \ldots, r_d^m$ are representations for $D$ samples of $m^{th}$ implicit cluster, then $\tilde{r}_1^m,\ldots, \tilde{r}_d^m$ represent their respective inferential forms. The updated inferential adaptive density discrimination ( $ADD^{inf}$ ) helps reduce the distance between $(r_d,\tilde{r_d})$ for implicit hate samples via Equation 3.
Here, $\mu ^m$ ( $\sigma ^2$ ) and $\tilde{\mu ^m}$ ( $\tilde{\sigma ^2}$ ) are the mean (variance) representations of the implicit and inferential/implied form for $m^{th}$ implicit cluster, respectively.
The above equation can be broken into two parts. The first part is equivalent to ADD thus focusing on reducing the intra-cluster distance within the implicit class. The second part brings the implicit class closer to its implied meaning. Meanwhile, in the case of explicit or non-hate clusters, there is no mapping to the inferential/implied cluster, and $ADD^{inf}$ in Equation 3 reduces to ADD in Equation 1.
Novel component: focal weight
Both $ADD^{inf}$ and ADD assign uniform weight to all samples under consideration. In contrast, we have established that some instances are closer to the boundary of the imposter clusters and harder to classify (i.e., contribute more to the loss). Inspired by the concept of focal cross-entropy (Lin et al. Reference Lin, Goyal, Girshick, He and Dollár2017), we improve the $ADD^{inf}$ objective by introducing $ADD^{inf + foc}$ . Under $ADD^{inf + foc}$ , the loss on each sample is multiplied by a factor called the focused term $(1-p^{ADD^{inf}}(r_d^m))^\gamma$ . $\gamma$ , a hyperparameter, acts as a magnifier. The formulation assigns uniform weight as $\gamma \rightarrow 0$ , reducing to $ADD^{inf}$ . Analogously, the focal term is paying “more attention” to specific data points. Even without inferential infusion, our novel focal term can be incorporated as $ADD^{foc}$ as enlisted in Equation 4.
Here, $\ell ^{ADD^{inf+foc}}$ ( $\ell ^{ADD^{foc}}$ ) captures the setup with (without) inferential objective. We utilize $p^{ADD^{inf}}$ (Equation 3) for the former and $p^{ADD}$ (Equation 1) for the latter. Despite $ADD^{foc}$ being a minor update on ADD, we empirically observe that focal infusion improves ADD.
Training pipeline
It should be noted that selecting the seed cluster and its subsequent imposter clusters is a random process for initial iterations. We assign the label with the highest loss margin for later iterations as the seed. Here, $ADD^{inf + foc}$ operates for implicit hate and overcomes the drawback of existing literature where implicit detection fails to account for implied context. It is also essential to point out that this evaluation is carried out in the local neighborhood, which is aided by the focal loss (Equation 4).
Overall loss
Apart from employing $r_d$ in $\ell ^{ADD^{*}}$ , it is also passed through a classification head $CH(r_d)$ . We combine CE with the focal inference to obtain the final loss of FiADD with $\beta$ controlling the contribution of the two losses as Equation 5.
Inference
During inference, the system does not have access to implied meaning. Once the PLM is trained via FiADD, the CH performs classification similar to any finetuned PLM. Here, we rely on the latent space being modified so that the implicit statements are closer to their semantic or implied form and sufficiently separated from other classes.
Note on K-mean
As a clustering algorithm, K-means is the most generic as it does not assume any dataset property (like hierarchy) except for the semantic similarity of the samples. Further, the K-means computation happens offline in each epoch, that is, it does not consume GPU resources. In the future, we aim to employ faster versions of K-means to improve training latency. Meanwhile, the computational complexity of FiADD during inference is the same as the finetuned PLM.
5. Experimental setup
FiADD provides an opportunity to improve the detection of implicit context. In the first set of experiments, we focus on the task of hate speech classification with datasets that consist of implicit hate labels. In the second set of experiments, we establish the generalizability of the proposed framework via SemEvalFootnote e datasets on three separate tasks. Table 2 provides both sets’ label-wise distribution. In all the tasks, the surface form of the text varies contextually from its semantic structure. Besides introducing the datasets and annotation schema, this section also outlines the hyperparameters and baselines curated for our evaluation.
Implicit hate classification datasets
Based on our literature survey of implicit hate datasets, we discard the ones that are either multilingual (DALC) or multimodal (ConvAbuse, MMHS150K), as modeling them is out of the scope of current work. Further, SBIC and ToxiGen do not offer 3-way labels; hence, they are discarded, too. From among the English datasets left for assessment, we drop ISHate as it is an aggregated dataset, and its implicit samples are already covered by LatentHatred. Finally, we have LatentHatred AbuseEval, and ImpGab as English-based text-only datasets with explicit, implicit, and non-hate labels that suit our task. For LatentHatred, we employ the first level of annotation and the existing manual annotations of implied hatred for implicit samples. Meanwhile, AbuseEval and ImpGab do not have the implied descriptions. We manually annotate the implicit samples of these datasets with their implied meaning generated as free text.
Annotation for implied hate
Implied contexts are succinct statements that make explicit the underlying stereotype. Note that the implied context cannot be considered a comprehensive explanation for implicit hate but rather a more explicit understanding of the underlying subtle connotations. For AbuseEval and ImpGab, two expert annotators (one male and one female social media expert; age range between 29 to 35) perform the annotations based on the following guidelines:
-
• Implied meaning should consider the post’s author’s perspective.
-
• Implied meaning should emphasize on the post’s content only.
-
• Annotations must be explicitly associated with the target entity.
-
• Annotations must contain a broader abusive context for the given post.
-
• Annotations should balance lexical diversity and uniformity w.r.t abuse toward a target group.
Annotation agreement
For our use case, annotation agreement scores help establish how well-aligned and coherent the explicit connotations are. To carry out the assessment, annotators A and B exchange a random sample of $30$ annotation pairs. They score the pairs on a 5-point Likert scale (Likert, Reference Likert1932), with 5 being the highest agreement. For AbuseEval, we obtain a mean agreement of $4.13 \pm 1.13$ and $4.07 \pm 1.41$ for ImpGab. Table 3 lists some sample annotations and their agreement scores. Further, a third expert (a 24-year-old male) conducts an independent survey using the above metric on the other set of random $30$ samples. As per annotator C, for AbuseEval, we obtain a mean agreement of $4.55 \pm 1.09$ and $4.41 \pm 1.15$ for ImpGab. This independent assessment corroborates the annotation process, as annotator C did not participate in the initial annotations yet observed similar alignment scores.
Generalizibility testing
We further consider three SemEval tasks for our generalizability analysis. Sarcasm detection (Abu Farha et al. Reference Abu Farha, Oprea, Wilson and Magdy2022) and irony detection (Van Hee et al. Reference Van Hee, Lefever and Hoste2018) are two-way classification datasets. Meanwhile, stance detection (Mohammad et al. Reference Mohammad, Kiritchenko, Sobhani, Zhu and Cherry2016) is a three-way classification. While we have implied annotations for sarcasm, they are missing for the other two datasets. Here, no additional annotations are performed.
Hyperparameters
We run all experiments on two Nvidia V100 GPUs. Three random seeds (1, 4, 7) are used per setup. We report each setup’s best performance based on overall macro-F1 out of three random seeds, where the best seed for a setup may vary. We follow an 80-20 split for the dataset across experiments (specific to the seed). In initial experiments, we observe that $ADD^{inf + foc}$ has a stronger influence on the later iterations, whereas CE influences the initial ones. Thus, to balance them throughout the training process, we put equal weightage on both using $\beta = 0.5$ . We consider $K=3$ with $M=2$ imposters for all experiments. We leave the experiments for $\beta$ and $M$ for future work. We set $100$ as the maximum K-means iterations in each training step. During finetuning, each training cycle is executed for a maximum of $5000$ epochs with all layers of PLM frozen.
PLMs
We begin our assessment with BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019). For hate speech detection, we also employ a domain-specific HateBERT (Caselli et al. Reference Caselli, Basile, Mitrović and Granitzer2021a) model to establish generalizability beyond BERT embedding. HateBERT is built upon the concepts of continued pretraining on top of BERT. Here, the corpus for performing another round of unsupervised masking language modeling is obtained from potentially offensive subreddits. For the SemEval tasks, we consider BERT and XLM (Chi et al. Reference Chi, Huang, Dong, Ma, Zheng, Singhal, Bajaj, Song, Mao, Huang and Wei2022) for evaluation based on their popularity in the SemEval. The PLMs variants are “bert-base-uncased” for BERT, “xlm-roberta-large” for XLM, and “GroNLP/hateBERT” for HateBERT.
Baselines
First, we assess the improvement in the performance of $ADD^{foc}$ over vanilla ADD (Equation 4) without the influence of cross-entropy. We follow the same prediction setup adopted in ADD (Rippel et al. Reference Rippel, Paluri, Dollár and Bourdev2016), where a sample gets assigned the label based on the nearest cluster in trained latent space during inference. We choose a simple Long short-term memory (LSTM) based (Hochreiter and Schmidhuber, Reference Hochreiter and Schmidhuber1997) model for quicker experimentation and compare the original ADD formulation with class weighted ADD ( $\alpha$ -ADD) and our proposed $ADD^{foc}$ . Table 4(a) shows a significant performance improvement of $8.2$ - $10.8$ % in overall macro-F1 using $ADD^{foc}$ across all three hate datasets. We thus recommend using our $ADD^{foc}$ variant instead of vanilla ADD for future works. Interestingly, we note that $\alpha$ -ADD does not outperform $ADD^{foc}$ . Hence, it is not employed in further experiments. Further, we perform a three-way classification using BERT to compare standalone alpha cross-entropy (ACE) against standalone $ADD^{foc}$ . The results are presented in Table 4(b). We observe that ACE outperforms the standalone $ADD^{foc}$ by a substantial margin of $7.6\%$ , $1.9\%$ , and $2.6\%$ for LatentHatred, ImpGab, and AbuseEval, respectively. Based on the above two experiments, we employ ACE as our baseline. As the proposed model introduces an additional loss complimenting ACE, we thus use ACE+ $ADD^{foc}$ variants as comparative systems.
6. Results and abaltions
In this section, we enlist the performance of FiADD for classifying implicit hate and discuss its robustness under different tasks and ablation setups. In both two- and three-way hate classifications, clustering is performed w.r.t. the three classes; however, the CH is determined by the specific setup, either two or three-way. For two-way hate classification, explicit (EXP) and implicit (IMP) labels are consolidated under the Hate class.
Two-way hate classification
From Table 5, we note that FiADD variants improve overall macro-F1 by $0.58$ ( $\uparrow 0.83$ %), $2.47$ ( $\uparrow 3.68$ %), and $0.56$ ( $\uparrow 0.79$ %) in LatentHatred, ImpGab, and AbuseEval, respectively, using BERT. However, except for maximizing hate macro-F1, the inferential objective does not significantly impact the final macro-F1 in the case of a two-way classification. It can be explained by the partially conflicting objectives between the final two-way result and $ADD^{inf + foc}$ ’s three-way objective, leading to higher misclassification.
Three-way hate classification
Inferential infusion reasonably impacts the outcome of the three-way classification task (Table 6). Overall, in three-way classification, $ADD^{inf + foc}$ provide an improvement of $0.09$ ( $\uparrow 0.17$ %), $0.47$ ( $\uparrow 1.02$ %), and $0.98$ ( $\uparrow 1.85$ %) in macro-F1 for LatentHatred, ImpGab, and AbuseEval, respectively, on BERT. It is noteworthy that we observe an even higher level of improvement for implicit hate class than overall. Compared to ACE in three-way classification, $ADD^{foc}$ helps AbuseEval with an improvement of $0.26$ macro-F1 ( $\uparrow 1.11$ %) in implicit hate. Meanwhile, $ADD^{inf + foc}$ helps LatentHatred and ImpGab with an improvement of $1.82$ ( $\uparrow 3.26$ %) and $0.39$ ( $\uparrow 4.39$ %), macro-F1, respectively, in implicit hate.
Generalizability test
The availability of implied annotations in the sarcasm dataset enables us to test FiADD’s $ADD^{Inf+foc}$ variant. The unavailability of such annotations in the other two tasks limits our experiments with only $ADD^{foc}$ variant. Table 7 (a) and (b) present the results for sarcasm detection and the other two tasks (irony and stance detection), respectively. Barring one setup, we observe reasonable improvements in macro-F1 (0.41–2.37) across all three tasks using both PLMs. Further, for the minority class considering the best of the BERT and XLM, we observe FiADD variants report an improvement of $6.06$ ( $\uparrow 23.96$ )%, $1.35$ ( $\uparrow 2.65$ %), and $3.14$ ( $\uparrow 5.42$ %) for the respective minority class in sarcasm, stance, and irony detection.
Impact of domain-specific PLM
Under HateBERT, FiADD variants improve two-way classification by overall $0.14$ ( $\uparrow 0.20$ %), $1.38$ ( $\uparrow 2.00$ %), and $0.13$ ( $\uparrow 0.18$ %) for LatentHatred, ImpGab, and AbuseEval, respectively. Similarly, FiADD variants improve three-way classification by overall $0.7$ ( $\uparrow 1.26$ %), $0.16$ ( $\uparrow 0.34$ %), $0.04$ ( $\uparrow 0.08$ %) for LatentHatred, ImpGab, and AbuseEval, respectively. However, the results with HateBERT show more variability. While all datasets in two-classification via HateBERT benefit from FiADD implicitness of AbuseEval, and ImpGab suffers under three-way classification. This variation can be attributed to a lot more offensive and slur terms in HateBERT’s training than BERT. Through this analysis, we are able to comment on the domain-specific (HateBERT) vs general-purpose (BERT) systems and their role in finetuning. Interestingly, this has been noted in other research in hate speech as well (Masud et al. Reference Masud, Khan, Goyal, Akhtar and Chakraborty2024a).
On the other hand, under generalization testing, which utilizes only general-purpose encoders (BERT and XLM), a high-performance improvement is observed in all minority classes.
Significance of hyperparameters
We further experiment with the hyperparameters of the FiADD. The experiments are performed on a two-way hate classification task on the AbuseEval dataset using BERT. The limited range for the probe is heuristically defined based on the sample size of categories. We recommend determining the values on a case-to-case basis for optimized performance. Figure 3(a) represents the significance of the number of subclusters per class ( $k$ ) in the range of [2-4]. We observe comparable performance for $K=3$ or $4$ . For our experiments, since four of the six datasets contain three classes, we use $K=3$ . The intuition is that within a subcluster of a class, the three subclusters represent a case of one of them having a high affinity to the class itself and two others being closer to their imposter classes. For example, within the implicit hate class, we assume at least one subcluster is easy to label as implicit, while there will likely be at least one cluster each that is closer to explicit and non-hate classes. Consequently, the setup leads to an imposter cluster value of $M=2$ . Meanwhile, the significance of the $\gamma$ coefficient used in the focused objective is presented in Figure 3(b). The probe is limited to [1-5] with a unit interval as followed in existing literature (Lin et al. Reference Lin, Goyal, Girshick, He and Dollár2017). We observe the best outcome with $\gamma =2$ , which incidentally aligns with the best value identified by (Lin et al. Reference Lin, Goyal, Girshick, He and Dollár2017).
7. Does FiADD really improve implicit hate detection?
Given the overall macro-F1 results on hate speech detection vary in a narrow range, significance testing will be inconclusive. We thus perform a granular analysis of the results across all seeds and assess how well FiADD modifies the latent space. We also conduct an error analysis of cases where implicit hate is easy and hard to classify.
Seed-wise analysis
Across three random seeds, two PLMs, and three datasets, we record the performance for 18 setups, each in two-way and three-way hate speech detection. We note from Tables 8 and 9 that out of the 36 combinations, only four instances register a drop in performance. It corroborates that FiADD’s improvements are not limited to a specific initialization setup. Interestingly, the setups that register failure are all under HateBERT. The results further contribute to the discussion on domain-specific PLMs in Section 6.
Error analysis
The motivation for FiADD is that implicit is closer to non-hate than explicit hate. Employing FiADD should correct the misclassified implicit labels if this hypothesis holds. On the other hand, a false positive may occur if the example is already close to explicit subspace. Further, moving it toward explicit space can cause misclassification. We, thus, consider a positive/negative case where the predicted label for an implicit sample is correctly/incorrectly classified. To explain these two scenarios, we estimate the relative distance of the implicit sample from explicit and non-hate clusters. First, we perform K-means clustering on non-hate and explicit latent space to identify their centers. We then calculate the average Manhattan distance between the implicit samples and these local density centers. Finally, we obtain the relative score from explicit space by normalizing between 0 and 1 the average explicit distance by the sum of average distances from non-hate and explicit spaces. For example, if the sample has a distance of 3 from explicit and 6 from non-hate centers, then the normalized distance will be $3/(3+6)=0.33$ .
We highlight a positive and a negative case in Figures 4 (a) and (b), respectively. In the positive case, the implicit sample is closer to non-hate space (Point A) under the ACE objective. After employing the FiADD, its relative position moves away from non-hate and closer to explicit (point B). In contrast, for the negative case, where the implicit sample is initially close to explicit hate (point A), our objective leads to misclassification. In the future, this problem can be reduced by introducing a constraint on the distance between implicit and explicit intact.
7.1 Latent space analysis
Building upon the cluster assessment in the error analysis, where we examined only a single positive and negative sample, we now perform an overall evaluation of how $ADD^{inf + foc}$ manipulates the embedding space. Inspired by the existing literature examining the latent space under hate speech datasets (Fortuna, Soler, and Wanner, Reference Fortuna, Soler and Wanner2020) and models (Kim et al. Reference Kim, Park and Han2022; Ocampo et al. Reference Ocampo, Sviridova, Cabrio and Villata2023c), we attempt to quantify the intercluster separation via Silhouette scores.
Silhouette score
It is a metric to measure the “goodness” of the clustering technique. It is calculated as a trade-off between within-cluster similarity and intercluster dissimilarity. Consider a system with E points ( $e_i$ ), each point belonging to one of the N clusters $c^j(e_i)$ . For $e_i \in c^a$ , its Silhouette score $SS_i=\frac{max(p_i,q_i)}{q_i-p_i}$ . $p_i$ captures the intra-cluster distance of $e_i$ to all the points within the cluster it belongs; $p_i = \frac{1}{|c^a|-1}\sum _{e^j \in c^a}dist(e_i,e_j)$ . $q_i$ captures the intercluster distance of $e_i \in c^a$ to all the points in the nearest cluster to $c^a$ ; $q_i = \frac{1}{|c^b|}\sum _{e^j \in c^b}dist(e_i,e_j)$ . The Silhouette score of a setup is, thus, $SS=\frac{1}{|E|}\sum _{e_i \in E}SS_i$ . Silhouette scores are measured on a scale of -1 to 1, with -1 being the worst set of cluster assignments.
Subclustering objective
After applying the $ADD^{inf + foc}$ objective, we expect not only the per-class clusters to be sufficiently separated but also the subclusters in each class to be better segregated to match their local neighborhood better. Figure 5 shows the implicit embedding space of AbuseEval, ImpGab, and LatentHatred after applying K-means on the default BERT embedding (a, d, g), BERT finetuned with the ACE (b, e, h), and FiADD (c, f, i) on three-way hate classification. The higher the Silhouette score, the better the subclusters are separated. $0.34$ , $0.31$ , and $0.51$ are the scores for cases (a), (b), and (c), respectively, in AbuseEval. $0.38$ , $0.24$ , and $0.52$ are the scores for cases (d), (e), and (f), respectively, in ImpGab. $0.32$ , $0.29$ , and $0.32$ are the scores for cases (d), (e), and (f), respectively, in LatentHatred.
Consequently, an increase of $0.20$ , $0.28$ , and $0.03$ scores is observed when comparing FiADD with ACE for AbuseEval, ImpGab, and LatentHatred, respectively. This increase in scores validates that the local densities within a class get further refined under $ADD^{inf + foc}$ objective. As expected, ACE suboptimally treats the implicit class as a single homogeneous cluster. Interestingly, for LatentHatred the score does not improve over the default BERT, even though it improves over ACE. A deeper analysis with multiple $K$ values might help here.
Inferential infusion
Given that $ADD^{inf + foc}$ brings the surface and semantic forms of implicit hate closer, we expect a significant drop in Silhouette scores between these clusters under FiADD. Figure 6 visualizes the embedding space of default BERT (a, d, g), BERT finetuned with the ACE (b, e, h), and FiADD (c, f, i) on three-way classification for AbuseEval, ImpGab, and LatentHatred. $0.18$ , $0.18$ , and $0.03$ are the scores for cases (a), (b), and (c), respectively, in AbuseEval. $0.18$ , $0.23$ , and $0.07$ are the scores for cases (d), (e), and (f), respectively, in ImpGab. $0.14$ , $0.13$ , and $0.01$ are the scores for cases (g), (h), and (i), respectively, in LatentHatred. It is important to highlight that for both BERT and BERT + ACE, there is no explicit objective to bring the implicit and implied clusters together. Hence, they act as a baseline for comparing how well the $ADD^{inf + foc}$ objective brings the two spaces closer.
A drop of $0.15$ , $0.16$ , and $0.12$ in the Silhouette score is observed when comparing BERT + ACE with FiADD for AbuseEval, ImpGab, and LatentHatred, respectively. It corroborates that the implicit and implied meaning representations are brought significantly closer to each other by employing our model. In addition to Tables 5 and 6, the latent space analysis also quantifies our manual annotations for AbuseEval and ImpGab, as inferential infusion (supported by the manual annotations) is improving the detection of implicit hate.
8. Conclusion
An increase in hate speech on the Web has necessitated the involvement of automated hate speech detection systems. To this end, we do not recommend completely removing human moderators; instead, we recommend employing machine learning-based systems to perform the first level of filtering. Following the rise of PLMs for text classification, they are now defacto for hate speech detection, too. However, PLM-based systems still suffer from understanding nuanced concepts, such as implicitness, and require external contextualization.
To this end, FiADD presents a generalized framework for semantic classification tasks in which the surface form of the source text differs from its inference form. For any system modeling this setup, the aim is to bring the two embedding spaces closer. In this work, the objective is achieved by optimizing for adaptive density discrimination via inferential infusion. Clustering accounts for variation in local neighborhoods beyond a single sample or a single positive/negative pairing; the inferential infusion assures that while we look into the local neighborhood, the implicit clusters are mapped to the apt semantic latent spaces. Further, this work introduces the focal penalty that pays more attention to the sample near the classification boundary. Even by itself, the $ADD^{foc}$ objective provides a considerable improvement over a standard loss function and can be applied as a substitute.
Overall, our inferential-infused focal $ADD^{inf + foc}$ provides a novel augmentation to the PLM finetuning pipeline. The efficacy of the FiADD’s variants is analyzed over three implicit hate detection datasets (with two of them being manually annotated by us for inferential context), three implicit semantic tasks (sarcasm, irony, and stance detection), and three PLMs (BERT, HateBERT, and XLM). By design, the $ADD^{inf + foc}$ objective does help improve the detection of hate in both two-way and three-way classifications. Our results call into question the role of domain-specific models like HateBERT against BERT as we observe that once finetuned, both of them perform comparably. It calls into question the role of domain-specific models in NLP.
A more granular examination of FiADD over the latent space for hate speech detection is performed via the analysis of—seed-wise performance measurement, latent space analysis of the embedding space clusters, and error analysis of positive and negative use cases. Over multiple seeds and 36 experimental setups, we observe the FiADD variants improve over ACE in 32 instances. Meanwhile, a closer look at the later space further highlights the significant improvement that FiADD has on the implicit clusters in bringing them near their implied meaning.
9. Limitations and future work
First, the current setup utilizes manual annotations of implicit meaning to be available for inferential clustering, requiring manual effort. Second, the proposed setup, being a novel approach in the direction of implicit detection, works on the de facto K-means and uses the same number of subclusters for all datasets.
In the future, we expect an infusion of generative models to pseudo-annotate the implied meaning, which can be paraphrased and rectified by human annotators on a need basis. Further, the proposed setup can be employed as an external loss to nudge the LLMs to generate better-quality adversarial examples. Meanwhile, to overcome performing K-means on the entire training set after each epoch, consider representations only for the given batch, starting with stratified sampling so that the batch is representative of the overall dataset. Recent advancements in hashing and dictionary techniques can improve computational efficiency. In the future, we aim to make the system more computationally efficient and extend its application to other tasks. It would be fascinating to review how focal infusion impacts the classification tasks in computer vision in comparison to the ADD setup.
Ethical concerns
This work focuses on textual features and does not incorporate personally identifiable or user-specific signals. For annotations, the annotators were sensitized about the task at hand and given sufficient compensation for their expert involvement. The annotators worked on $\approx$ 250 samples per day over four days to avoid feeling fatigued. Further, the annotators had access to the Web; while annotating, they referred to multiple news sources to understand the context. The dataset of inferential statements for AbuseEval and ImpGab will be available to researchers on request.
Acknowledgments
Sarah Masud acknowledges the support of the Prime Minister Doctoral Fellowship in association with Wipro AI and Google India PhD Fellowship. Tanmoy Chakraborty acknowledges the financial support of Anusandhan National Research Foundation (CRG/2023/001351) and Rajiv Khemani Young Faculty Chair Professorship in Artificial Intelligence.