1 Introduction
Supervised topic classification requires labeled data for training. This often becomes a bottleneck as high-quality labeled data are expensive to acquire. One way to overcome data scarcity is to use cross-domain topic classification (Osnabrügge, Ash, and Morelli Reference Osnabrügge, Ash and Morelli2021), where researchers train a model from a source domain with large labeled datasets and make inferences on the target domain where labeling is limited. This method takes advantage of two observations: rich labeled data from a source data set and high similarity between the source set and the target set. To evaluate the accuracy of the cross-domain classifier, researchers only need to annotate a small dataset in the target domain.
With the advent of language models (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), however, researchers no longer have to train models from scratch as is done in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021). Rather, researchers could take advantage of existing pretrained language models and fine-tune the already well-trained parameters on specific downstream tasks. Given that language models are known to require a relatively small number of training samples to yield good performance (Longpre, Wang, and DuBois Reference Longpre, Wang and DuBois2020), the small annotated dataset in target domain, which is required for validating cross-domain classifiers, might be sufficient to directly train an accurate in-domain classifier.
In this letter, we present topic classification with pretrained language models as an alternative solution to the data scarcity problem. We show that language models fine-tuned with a portion (70%) of the dataset in the target domain, originally annotated for the cross-domain verification purpose, could substantially outperform cross-domain topic classifiers and that 300 training samples alone would suffice for language models to match or surpass the performance of cross-domain classifiers in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021). We further show that fine-tuning these language models could well fit into researchers’ time budgets.
2 Methodology
Pretrained language models are state-of-the-art models in various natural language processing (NLP) tasks (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019; Lan et al. Reference Lan, Chen, Goodman, Gimpel, Sharma and Soricut2020; Liu et al. Reference Liu2019). The heavy lifting is done during the pretraining stage, where large amounts of unlabeled text, for example, the English Wikipedia, are used to train multilayer transformer models (Vaswani et al. Reference Vaswani2017) for masked language modeling and replaced token detection among other tasks (Clark et al. Reference Clark, Luong, Le and Manning2020).Footnote 2 Fine-tuning these pretrained language models has achieved state-of-the-art results in various NLP tasks, including classification and question answering.
Compared with other NLP models that require training with randomly initialized parameters, one advantage of pretrained language models is that they have large amounts of knowledge packed into their parameters during the pretraining stage and thus they require only a small labeled dataset for fine-tuning these parameters to achieve superb performance (Longpre et al. Reference Longpre, Wang and DuBois2020). This fits well with topic classification for political texts, where labeling is expensive, and offers us an alternative to cross-domain topic classification, which trains parameters from scratch using a labeled dataset from a different but similar domain.
For this letter, we fine-tune a RoBERTa-base model (Liu et al. Reference Liu2019) using the target dataset from Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021) for topic classification.Footnote 3 RoBERTa-base has 12 layers of transformers and 125 million parameters in total. On top of its 12 layers of transformers, we add a classification layer for 44-topic classification and 8-topic classification, respectively.Footnote 4 We use cross-entropy as the loss function. We fine-tune the RoBERTa-base model with a learning rate of 2e-5, a batch size of 16, and an input sequence length of 512 on an A100 GPU. We set the sequence input length to the maximum 512.Footnote 5 We use the validation set’s accuracy to select the best epoch and the optimal checkpoint. We then use the optimal checkpoint to make inferences on the test set with a batch size of 64. For easy comparison, we use the same evaluation metrics as used in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021).
For constructing the train, validation, and test sets, we use the 4,165 New Zealand parliamentary speeches in the target domain in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021).Footnote 6 These 4,165 New Zealand parliamentary speeches were originally labeled to verify the effectiveness of cross-domain topic classification. In this letter, we show that these labeled speeches alone are sufficient to train a competitive topic classifier by fine-tuning a pretrained language model. In our main experiment, we randomly sample 70% from the dataset as the training set, 15% as the validation set, and the remaining 15% as the test set. In total, 2,915 samples are used for training, 625 for validation, and 625 for testing. For reproducibility, we have set a random seed for nondeterministic operations in the experiment (Zhang et al. Reference Zhang, Wu, Katiyar, Weinberger and Artzi2021) and we report the averaged results of five random runs.
3 Results
3.1 Main Experiment
In the main experiment, we use 2,915 (70%) samples for training, 625 (15%) samples for validation, and 625 (15%) samples for testing and run five times with five random seeds. We report the experiment results in Table 1 with mean and standard deviation for each metric. Across all metrics and both 44-topic and 8-topic classification tasks, fine-tuning the RoBERTa model with a subset of the labeled New Zealand parliamentary speeches substantially outperforms the cross-domain topic classifier by Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021), which is trained using 115,420 annotated policy statements.
Specifically, for 44-topic classification, our top-1 accuracy stands at 52.7% and is 27.3% higher than that of the cross-domain classifier, which stands at 41.4%. For 8-topic classification, our top-1 accuracy stands at 63.1% and is 22.5% higher than that of the cross-domain classifier, which stands at 51.5%. We see large gains in other metrics as well: 10%+ gain in top-3 accuracy, 5%+ gain in top-5 accuracy, 16%+ gain in balanced accuracy, and 12%+ gain in F1 macro. The results suggest that it is feasible to train a competitive in-domain classifier with a portion of the target corpus.
3.2 Performance by Topic
In Table 2, we compare the performance of the fine-tuned RoBERTa models with that of the 44-topic cross-domain classifier (top) and the 8-topic cross-domain classifier (bottom) in terms of the accuracy of each topic in the test set using one of the five random runs from the main experiment.Footnote 7 One immediate observation is that for the 44-topic classification, the fine-tuned RoBERTa model performs better for larger topics. For this particular run, for topics with more than 10 samples in the test set, the fine-tuned RoBERTa model does better or equally well for all topics with the exception of “education” and “equality.”
Our second observation is that the fine-tuned RoBERTa model’s advantage over the cross-domain classifier disappears on rare topics, such as “nationalization” and “underprivileged minority groups.” Because these topics are rare, the RoBERTa model did not see enough such samples during the training stage.Footnote 8 By contrast, the cross-domain classifier has seen considerably more such samples during its training stage with party manifestos. The cross-domain classifier thus has an advantage in predicting samples on rare topics correctly.Footnote 9
Our third observation is that for 8-topic classification, the fine-tuned RoBERTa model outperforms the cross-domain classifier for seven of the eight topics. This is not surprising given that with fewer topics, the number of samples in each topic will become larger, which in turn ensures that the RoBERTa model sees enough training samples for each topic during the fine-tuning stage. In the next subsection, we explore this question from a slightly different angle: what is the minimum number of samples that we need for the fine-tuned language model to outperform the cross-domain classifiers?
3.3 Number of Training Samples
In this experiment, we study the number of training samples that the fine-tuned language model requires in order to match the performance of that in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021). Our experiment is motivated by the observation that oftentimes researchers may not have access to an annotated target set with as many as 2,915 training samples as we did in the main experiment. Will the fine-tuned language model remain competitive with a much smaller training set? We report our results in Figure 1. We fine-tune the language model for 20 epochs with 200, 300, and 400 training samples, respectively, and split the remaining samples evenly into the validation set and the test set. We run each setting five times and report the mean of top-1 accuracy plus one standard deviation and the mean minus one standard deviation.Footnote 10 For easy comparison, we also include the corresponding performance by the cross-domain classifier as reported in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021).
We observe that with 300 training samples, the fine-tuned language model is able to outperform the cross-domain classifier on the 44-topic classification task (left) and the 8-topic classification task (right). This suggests that depending on task difficulty, researchers with a few hundred training samples may consider a fine-tuned language model as an effective option.
3.4 Training and Inference Time
While language models are known to be slow given their large sizes, we note that their training and inference time could well fit into the time budget of most researchers. In terms of training, on a single A100 GPU with 40 GB memory, it takes 27 minutes to train the model for 20 epochs over 2,915 samples.Footnote 11 Training time will further decrease linearly as we use fewer training samples. To put that into perspective, we note that training the cross-domain classifier with cross-validation in Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021) takes 27 minutes on an iMac with 16 CPUs, and generating a single OLS regression table on large datasets could take more than 20 minutes (Stone, Wang, and Yu Reference Stone, Wang and Yu2022). It is certainly not fair to compare GPU time with other models’ CPU time, but we want to note that from the researchers’ point of view, the amount of time used in fine-tuning a language model is mostly comparable to other research methods.
Compared with training, inference is significantly faster. With a batch size of 64, our model makes around 145 inferences per second on a single A100 GPU, which generalizes to 10,000 inferences in a little over 1 minute. With such a quick turnaround in training and inference, our method should fit into the time budget of most researchers.
4 Conclusion
Osnabrügge et al. (Reference Osnabrügge, Ash and Morelli2021) recently proposed cross-domain supervised training to take advantage of existing labeled data and to reduce data collection costs in classifying political texts. In this letter, we have proposed an alternative that builds on pretrained language models. We have shown that fine-tuning a pretrained language model requires only a small annotated dataset. As a matter of fact, we have shown that with just a small portion (10%) of the annotated dataset that was originally used to evaluate the cross-domain classifier, a fine-tuned RoBERTa-base model can outperform the cross-domain classifier. We have also noted that in topics where there are few to no in-domain training samples, the advantage of the fine-tuned language model over cross-domain classifiers largely disappears. Lastly, we have shown that the fine-tuned models are competitive in terms of training time and inference time. Future research could explore the broader application of pretrained language models, alongside cross-domain classifiers, to other research questions, such as populism prediction (Cocco and Monechi Reference Cocco and Monechi2022), sentiment and stance analysis (Bestvater and Monroe Reference Bestvater and Monroe2022), and party position analysis (Herrmann and Döring Reference Herrmann and Döring2021), as well as the optimization of pretrained language models in training and inference.
Acknowledgments
I thank the reviewers and the editor for their excellent comments and guidance, which substantially improved the paper and inspired new ideas for future work. It is no overstatement to say that they contributed to the paper as co-authors.
Data Availability Statement
The replication materials (Wang Reference Wang2023) are available at https://doi.org/10.7910/DVN/FMT8KR.
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2023.3.