1. Introduction
Recently, a large number of models have made breakthroughs in various datasets of natural language processing (NLP) (Kenton and Toutanova Reference Kenton and Toutanova2019; Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen and Stoyanov2019). Meanwhile, an increasing number and variety of NLP datasets are proposed for model training and evaluation (Malmasi et al. Reference Malmasi, Fang, Fetahu, Kar and Rokhlenko2022; Yin et al. Reference Yin, Radev and Xiong2017; Srivastava et al. Reference Srivastava, Rastogi, Rao, Shoeb, Abid, Fisch and Wang2022).
However, despite datasets significantly impacting model development and assessment (Bommasani et al. Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx and Liang2021), their quality is seldom systematically verified. Recent literature has indicated various quality issues within NLP datasets, for example, label mistakes (Wang et al. Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill and Bowman2019). Datasets with quality issues frequently give rise to model shortcuts (Gururangan et al. Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2022; Poliak et al. Reference Poliak, Naradowsky, Haldar, Rudinger and Van Durme2018) or induce incorrect conclusions (Goyal et al. Reference Goyal, Li and Durrett2022; Rashkin et al. Reference Rashkin, Nikolaev, Lamm, Aroyo, Collins, Das and Reitter2023).
In this paper, we aim to answer two primary questions: (1) How to evaluate dataset quality in a model-agnostic manner? A comprehensive dataset quality evaluation is crucial for selecting adequate training resources. Furthermore, when there are discrepancies in model performance across different datasets, an unbiased evaluation of dataset quality can serve as a reliable arbitrator. (2) How do the statistical scores on dataset properties affect the model performance? The insights gained from this will guide improvements in dataset quality, which is crucial for developing effective and unbiased models.
To this end, we introduce a dataset evaluation framework (Figure 1) and take the named entity recognition (NER) datasets as a case study. Inspired by Classical Test Theory (CTT) (Novick Reference Novick1966) in psychometrics, our dataset evaluation framework includes three key dimensions: reliability, difficulty, and validity. Reliability reflects how credible the dataset is, difficulty represents dataset difficulty and differentiation for models, and validity means how well the dataset fits the motivation and goal of the task. Following this framework, we introduce nine metrics under the three dimensions for the statistical properties of NER datasets and assess the quality of ten widely used NER datasets.
Extensive experimental results validate that our evaluation metrics derived from the dataset properties are highly correlated with the performance of NER models and human evaluation results. The evaluation results enhance our comprehension of the datasets and bring some novel insights. For example, one of the most widely used English NER datasets, CoNLL03 (Sang and De Meulder Reference Sang and De Meulder2003), is far less challenging ( $0.43$ , $0.30$ , and $2.63$ points lower on the Unseen Entity Ratio, Entity Ambiguity Degree, and Model Differentiation metrics, respectively) than WNUT16 (Strauss et al. Reference Strauss, Toma, Ritter, De Marneffe and Xu2016), which has received less attention previously. In addition, by controlled dataset adjustment (Sec. 6.4), we find the dataset quality on the statistical metrics, including Unseen Entity Ratio, Entity Ambiguity Degree, and Entity-Null Rate, affects the NER model performance significantly.
We believe that statistical dataset evaluation provides a direct and comprehensive reflection of the dataset quality. And we recommend dataset quality evaluation before training or testing models for a better understanding of tasks and data for other tasks in NLP.
2. Related work
2.1 Issues in NLP datasets
Recent works have shown that NLP datasets have a number of quality problems, for example, label mistakes (Wang et al. Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill and Bowman2019), entity missingFootnote a (Tejaswin et al. Reference Tejaswin, Naik and Liu2021), and unwanted biases resulting from the annotation process (Kaushik and Lipton Reference Kaushik and Lipton2018; Nadeem et al. Reference Nadeem, Bethke and Reddy2021). For instance, Wang et al. (Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill and Bowman2019) identified a notable 5.38 percent rate of label mistakes in the CoNLL03 NER dataset, a concerning figure for a widely used benchmark in NLP research. Tejaswin et al. (Reference Tejaswin, Naik and Liu2021) manually checked 600 randomly selected instances from three sources: CNN/DailyMail (Hermann et al. Reference Hermann, Kocisky, Grefenstette, Espeholt, Kay, Suleyman and Blunsom2015; Nallapati et al. Reference Nallapati, Zhou, dos Santos, Gulçehre and Xiang2016), Gigaword (Rush et al. Reference Rush, Chopra and Weston2015), and XSum (Narayan et al. Reference Narayan, Cohen and Lapata2018), which are datasets commonly used for text summarization tasks. Their analysis revealed a significant proportion of instances with issues of Entity Missing and Evidence MissingFootnote b in these datasets. This indicates that the target summaries often contained entities or concepts absent from the source texts, raising questions about the accuracy of these datasets.
Furthermore, studies by Sugawara et al. (Reference Sugawara, Stenetorp, Inui and Aizawa2020) and Gururangan et al. (Reference Gururangan, Swayamdipta, Levy, Schwartz, Bowman and Smith2022) suggest that performance metrics on certain machine reading comprehension and natural language inference datasets might be artificially inflated. This is attributed to models exploiting spurious correlations rather than truly understanding the underlying language structures, resulting in poor generalization when applied to real-world scenarios.
An equally significant concern in NLP dataset construction is data leakage, particularly test-train overlap, which poses substantial risks to model evaluation. Studies like Lewis et al. (Reference Lewis, Stenetorp and Riedel2021) reveal that a considerable portion of test data may mirror the training set, risking models’ overfitting to the data rather than generalizing, thus inflating performance scores. Larson et al. (Reference Larson, Lim and Leach2023) echoes this sentiment, highlighting similar concerns in document classification realms. These studies collectively call for improved dataset division methods and robust validation techniques to mitigate data leakage and truly measure a model’s generalization capabilities on unseen data.
However, most works focus on a specific issue of the datasets, and most issues are highly related to the model training process. Inspired by CTT, we built our dataset quality evaluation framework from reliability, difficulty, and validity dimensions. And we developed metrics for assessing the quality of datasets under the above three dimensions in conjunction with NER task characteristics and experimentally validated the effectiveness of our metrics in dataset evaluation.
2.2 Data-centric AI
In the contemporary landscape of NLP and machine learning, the pivotal role of datasets has increasingly been acknowledged. The seminal work by Ng et al. (Reference Ng, Laird and He2021) has galvanized the shift toward a data-centric AI paradigm, underscoring the potential of enhancing data quality to achieve superior model performance over merely refining algorithms. This approach dovetails with the initiatives like the NHS’s Data Quality Maturity Index Methodology,Footnote c which provides a structured framework to assess and improve the quality of data in healthcare, a sector that greatly benefits from NLP technologies.
Simultaneously, the introduction of DataCLUE by Xu et al. (Reference Xu, Liu, Pan, Lu and Hou2021) marks a significant stride in this domain, offering the first benchmark specifically tailored for evaluating data-centric approaches in NLP. This benchmark aligns with tools such as the Data Quality for AI Tool (Jariwala et al. Reference Jariwala, Chaudhari, Bhatt and Le2022) provided by IBM, which facilitates exploratory data analysis through its API, thereby enabling a more rigorous and systematic enhancement of datasets.
Moreover, a comprehensive review (Zha et al. Reference Zha, Bhat, Lai, Yang, Jiang, Zhong and Hu2023) offers a detailed exploration of the need for data-centric AI, addressing the methodological pivot from a model-centric to a data-centric perspective in AI research. This survey highlights the indispensable need for high-quality data to train robust machine learning models, especially in domains where data are prone to noise, sparsity, and bias.
In light of these developments, our proposed evaluation metrics aim to contribute to the ongoing efforts of dataset quality improvement. These metrics are designed to facilitate both automatic and semi-automatic enhancements of datasets, ensuring that the data used to train NLP models are of the highest fidelity and thus capable of driving the performance of these models to new heights. The systematic application of such metrics can significantly streamline the process of data quality assurance, making it more tractable for researchers and practitioners to achieve data excellence in AI systems.
The cumulative effect of these methodologies and tools signifies a transformative movement in AI research, where data are no longer a passive element but a dynamic and critical component of the AI development lifecycle. As this data-centric ethos permeates the field, it is anticipated that future advancements in NLP will be increasingly driven by innovations in data quality management, thereby catalyzing a new era of AI systems that are both powerful and reliable.
3. Classical Test Theory
Human tests or exams usually follow strict testing theories, such as CTT (Novick Reference Novick1966), a statistical framework to measure the quality of the exams. According to CTT, a thorough and systematic evaluation should consider three dimensions: reliability, difficulty, and validity.
In this paper, we introduce CTT for Dataset Evaluation. Adapting traditional CTT to dataset evaluation, we specified the definitions of reliability, difficulty, and validity as follows:
-
• Reliability measures the trustworthiness of the evaluation dataset. For instance, datasets with a high number of labeling errors lack sufficient confidence to evaluate the performance of different models.
-
• Difficulty is used to assess how the dataset differentiates between various models and human–machine performance in terms of difficulty.
-
• Validity aims to evaluate how well the dataset effectively measures the capability of models.
4. Dataset quality evaluation framework
Following CTT for Dataset Evaluation, we build our statistical dataset evaluation frameworkFootnote d and apply it to NER datasets. It includes nine fundamental metrics of the statistic properties in the NER datasets. In this section, we introduce the definitions and the mathematical formulations of the proposed metrics.
For a datasetFootnote e $D$ with $n$ instances, let $(x^{(i)}, y^{(i)})$ represent the i-th instance ( $i=1,2,\ldots,n$ ). The input sequence $x^{(i)}$ consists of $m^{(i)}$ tokens, and the output sequence $y^{(i)}$ consists of $m^{(i)}$ entity values. Let $\mathscr{C}$ represent the entity types in $D$ (including “Not an entity”), and each entity type $c_j \in \mathscr{C}, j \in{1,2,\ldots,v}$ , where $v$ represents the total number of entity types. We use $Te, Tr, De$ to represent the test set, the training set, and the development set, respectively. The function $e(\boldsymbol{y_D})$ is defined to obtain a set of entity values in the set of $y^{(i)}$ of $D$ , $\boldsymbol{y}$ , and sometimes we omit $D$ for simplification.
4.1 Metrics under reliability
The metrics under reliability aim to evaluate how accurate and trustworthy a dataset is, including Redundancy, Accuracy, and Leakage Ratio. Reliability metrics—Redundancy, Accuracy, and Leakage Ratio—are key elements in assessing a dataset’s trustworthiness. The evaluation of Redundancy aims to uncover duplicate information within the dataset, which is crucial for ensuring consistency in results as it aids in securing an unbiased representation of data. By manually verifying Accuracy, we can assess the dataset’s capability in accurately reflecting real-world information, a fundamental basis for reliable outcomes. The detection of the Leakage Ratio prevents the spillover of knowledge from test data to training data, essential for measuring the model’s true performance. Together, these metrics form the cornerstone of dataset reliability, ensuring the effectiveness of NLP modeling.
Redundancy measures the proportion of duplicate instances in a dataset $D$ . A lower Redundancy value is better as it indicates fewer duplicates and, therefore, a higher diversity in the data. It is calculated by dividing the number of instances appearing more than once by the dataset’s total number of instances:
In the case of Accuracy, a higher value is preferred because it reflects the proportion of correctly annotated instances, suggesting a more reliable dataset. Accuracy aims to evaluate the annotation correctness of the dataset and can be calculated as follows:
We recommend selecting 100 instances from each dataset split and inviting at least three professional linguists to annotate the Accuracy. To evaluate inter-rater reliability, we compute the Cohen Kappa coefficient (Cohen Reference Cohen1960) pairwise among three annotators, subsequently averaging these values. A mean Kappa exceeding $0.75$ indicates substantial rater agreement, ensuring annotation reliability.
Leakage Ratio is a critical metric used to assess the extent of data leakage between different dataset partitions, specifically how many instances in the test set ( $Te$ ) have incorrectly appeared in the training set ( $Tr$ ) or development set ( $De$ ). A lower Leakage Ratio is indicative of better dataset partitioning as it suggests that there is minimal to no overlap between the sets, which is essential for preventing models from merely memorizing specific instances instead of learning to generalize. The Leakage Ratio is defined as:
4.2 Metrics under difficulty
We propose four metrics under difficulty to assess how challenging the datasets are, including three intrinsic metrics (Unseen Entity Ratio, Entity Ambiguity Degree, and Text Complexity) and one extrinsic metric (Model Differentiation). These difficulty metrics assess a dataset’s challenge level for NLP models. Unseen Entity Ratio tests generalization by measuring novel entities, pushing models beyond their training. Entity Ambiguity Degree and Text Complexity challenge models with varied entity types and dense entity arrangements, requiring nuanced interpretation. Model Differentiation shows a dataset’s power to separate model performances, testing robustness. Together, they define the dataset’s challenge in terms of generalization, ambiguity, density, and differentiation, fitting the difficulty dimension.
The Unseen Entity Ratio quantifies the proportion of new entities in the test set labels that are not present in the training set, promoting the model’s ability to generalize. A higher Unseen Entity Ratio is desirable as it indicates a greater challenge for the model to recognize entities it has not encountered during training. The calculation is as follows:
Entity Ambiguity Degree is mainly used to measure how many entities are labeled with more than one kind of entities types. For example, if “apple” is labeled as “Fruit” in one instance and labeled as “Company” in another instance, then there is a conflict in $D$ . A higher Entity Ambiguity Degree represents a more challenging dataset because it indicates more instances where an entity is labeled with different types, thereby confusing NER models. We introduce $e^*(D)$ to represent the number of conflict entities in dataset $D$ and obtain the Entity Ambiguity Degree by:
Text Complexity measures the average Entity Density in sentences within the dataset. Higher Text Complexity signals a more difficult dataset because it implies that sentences are densely packed with entities, requiring more nuanced understanding and recognition by the model. It is formulated as:
Model Differentiation evaluates the dataset’s ability to distinguish the performance of different models. A higher Model Differentiation value is better as it indicates that the dataset can effectively reveal differences in model performances, making it a useful tool for benchmarking. It is determined using the standard deviation of the scores of $k$ different models:
We recommend using the top five model scores on the datasetFootnote f for ModDiff calculation.
4.3 Metrics under validity
The metrics under validity, for example, Entity Imbalance Degree and Entity-Null Rate for NER datasets, are mainly proposed to evaluate the effectiveness of the dataset in evaluating the model’s ability on the specific task. Validity metrics like Entity Imbalance Degree and Entity-Null Rate assess if a dataset can effectively evaluate a model’s task-specific abilities. Entity Imbalance Degree checks for equal entity representation, ensuring models learn without bias—a key for valid evaluations. Entity-Null Rate measures how rich the dataset is in entity examples, vital for testing model learning depth. Both metrics directly contribute to assessing a dataset’s ability to provide a fair and thorough evaluation of model performance, embodying the essence of validity.
Entity Imbalance Degree mainly measures the unevenness of the distribution of different entities in $D$ . A lower Entity Imbalance Degree is better as it indicates a more balanced distribution of entity types, which is desirable for ensuring that the model is equally exposed to all categories and does not develop a bias toward the more frequent ones. Specifically, we use standard deviation to quantify the degree of dispersion of the distribution of all the different types of entities $\mathscr{C}$ in the dataset:Footnote g
Entity-Null Rate evaluates the proportion of instances in the dataset that do not contain any entity. A lower Entity-Null Rate is preferred because it suggests that the dataset contains a richer set of examples for the model to learn from, with more instances that include entity information. The Entity-Null Rate is defined as:
5. Statistical dataset evaluation for NER
To validate our statistical dataset evaluation methods, we assess the quality of ten widely used NER datasets, including three English NER datasets and seven Chinese NER datasets. The evaluation results for ten NER datasets are shown in Table 1. Figure 2 presents the evaluation results of WNUT16, CoNLL03, Resume, and MSRA under different dimensions and metrics.
$^{-}$ indicates that the dataset and evaluation model scores have not been found on the Paperswithcode website, so the model discrimination of this dataset cannot be calculated. The upper rows are Chinese NER datasets, and the lower rows are English NER datasets. $^{\uparrow }$ indicates that the larger the value, the better the quality of the dataset on this metric. $^{\downarrow }$ indicates that the lower the value, the better the quality of the dataset on this metric.
5.1 Datasets
We provide the basic information about the datasets in Table 2.
Zh and En mean Chinese and English, respectively. It is important to note that OntoNotes 4 has four common tags in the Chinese dataset, although OntoNotes 4 has a total of eighteen tags (for the English dataset).
English NER datasets include the following: CoNLL03 NER (Sang and De Meulder Reference Sang and De Meulder2003) is a classical NER evaluation dataset consisting of 1,393 English news articles. WNUT16 NER (Strauss et al. Reference Strauss, Toma, Ritter, De Marneffe and Xu2016) is provided by the second shared task at WNUT-2016 and consists of social media data from Twitter. OntoNotes5 (Weischedel et al. Reference Weischedel, Palmer, Mitchell, Hovy, Pradhan, Ramshaw and Houston2013) is a multi-genre NER dataset collected from broadcast news, broadcast conversation, weblogs, and magazine genre, which is a widely cited English NER dataset.
Chinese NER datasets consist of the following: CLUENER (Xu et al. Reference Xu, Dong, Liao, Yu, Tian, Liu and Zhang2020), a well-defined NER dataset, includes finer-grained entity types beyond standard ones (person, organization, and location), such as Company, Game, and Book. OntoNotes4 (Weischedel et al. Reference Weischedel, Pradhan, Ramshaw, Palmer, Xue, Marcus and Houston2011) is copyrighted by Linguistic Data ConsortiumFootnote h (LDC), a large manual annotated database containing various fields with structural information and shallow semantics. MSRA (Levow Reference Levow2006) is a large NER dataset in the field of news, containing distinctive text structure characteristics. PeopleDaily NER Footnote i is a very classic benchmark dataset to evaluate different NER models. Resume NER (Zhang and Yang Reference Zhang and Yang2018) features resumes of senior executives from Chinese stock market companies, with a high annotator agreement of 97.1 percent. It includes 1027 randomly selected summaries annotated for 8 entity types using the YEDDA system (Yang et al. Reference Yang, Zhang, Li and Li2018). They randomly select 1027 resume summaries and manually annotate 8 types of named entities with YEDDA system (Yang et al. Reference Yang, Zhang, Li and Li2018). The inter-annotator agreement is 97.1 percent. Weibo NER (Peng and Dredze Reference Peng and Dredze2015; He and Sun Reference He and Sun2017) is sourced from the Sina Weibo social media platform. WikiAnn (Pan et al. Reference Pan, Zhang, May, Nothman, Knight and Ji2017) is a Chinese part of a multilingual NER dataset from Wikipedia articles.
5.2 Settings
According to the metrics we proposed in Sec. 4, we calculate the statistical scores for each dataset. Specifically, we average the scores of the training, the development, and the test split of the datasets for Redundancy, Accuracy, Entity Ambiguity Degree, Entity Density, Entity Imbalance Degree, and Entity-Null Rate, respectively. For Leakage Ratio, Unseen Entity Ratio, and Model Differentiation, we only calculate the scores on the specific splits involved according to Sec. 4.1 and Sec. 4.2.
Red and Acc denote Redundancy and Accuracy, respectively. Zh and En mean Chinese and English, respectively.
5.3 Dataset reliability
5.3.1 Annotation Accuracy
Accuracy scores quantitatively inform us that we cannot take it for granted that all benchmark datasets are reliable.
We observe that CLUENER has the lowest Accuracy score. In particular, it has 0.17 (17 percent) errors in its development set (shown in Table 3). Conversely, the other datasets (e.g., Resume and WNUT16) have a relatively high Accuracy score for both Chinese and English NER datasets.
5.3.2 Leakage Ratio
The dataset’s shortcomings (under the reliability dimension) can be effectively revealed by the Leakage Ratio. Given the Leakage Ratio results, we are surprised to find that Weibo and WikiAnn have serious data leakage issues.
As shown in Table 1 and Fig. 3, 0.17 (17 percent) and 0.13 (13 percent) of the instances in the test set of Weibo and WikiAnn have appeared in their corresponding training or development sets, respectively.
5.3.3 Overall reliability
Combining several metrics under the reliability dimension in Table 1, we can conclude that Resume and MSRA maintain high reliability.
In specific, there is no data redundancy in Resume and MSRA. That is to say, the instances of each part of the dataset are unique and non-repeating. Additionally, they achieve the highest Accuracy scores and hardly show data leakage problems, with a Leakage Ratio of 0.01 (1 percent) and 0.00 (0 percent), respectively.
5.4 Dataset difficulty
5.4.1 Unseen Entity Ratio
Results on Unseen Entity Ratio (UnSeenEnR) demonstrate the generalization ability of NER models on unseen entities.
The evaluation results show that Weibo and WNUT16 are more difficult in terms of UnSeenEnR because their test sets have a 0.56 (56 percent) and a 0.89 (89 percent) ratio of entities that have not appeared in training, respectively. WikiAnn is the Chinese dataset only second to Weibo that can better evaluate the generalization ability of NER Models. Conversely, PeopleDaily NER and OntoNotes5 are suboptimal for evaluating model generalization ability. Our experimental results in Sec. 6.5 reveals that model trained on them are more likely to perform better on seen entities compared to those that have not appeared in the training set.
5.4.2 Entity Ambiguity Degree
Entity Ambiguity Degree (EnAmb) captures observable variation in the information complexity of datasets.
Given our findings, OntoNotes 4 and WNUT16 are the Chinese and English NER datasets with the highest Entity Ambiguity Degree, respectively, which means that they are more difficult for models to accurately predict entity types. Consistent with our conclusion (in Sec. 6.5), Bernier-Colborne and Langlais (Reference Bernier-Colborne and Langlais2020) also argue that SOTA models cannot (or are not able) deal well with the entities labeled differently in different contexts.
5.4.3 Model Differentiation
Extrinsic evaluation metrics, such as Model Differentiation, are also necessary for evaluating the difficulty of datasets.
Unlike those intrinsic evaluation metrics (e.g., Entity Ambiguity Degree), Model Discrimination (ModDiff) aims to assess the dispersion of model scores on a unified benchmark dataset. That is to say, a more difficult dataset should have a clear distinction between models with different abilities. As shown in Table 1, CLUENER and WNUT16 are Chinese and English datasets that can better distinguish model performance, respectively.
5.4.4 Overall difficulty
WNUT16 is a more difficult benchmark for English NER as a whole.
Although WNUT16 has fewer citations than CoNLL03 and OntoNotes5, as demonstrated in Table 1, WNUT16 has a higher Entity Ambiguity Degree and Unseen Entity Ratio than the other two English NER datasets. Meanwhile, we find that the model performance gap on WNUT16 is large, indicating that it is more difficult and can effectively distinguish models with different performances.
5.5 Dataset validity
5.5.1 Entity Imbalace Degree
Datasets with uneven distribution of entity types may not effectively evaluate the ability of models on the long-tailed instances.
Intuitively, the model does not perform as well on those long-tailed entity types as other entities. We observe that Weibo achieves the highest Entity Imbalance Degree (EnImBaD) by a large margin, indicating that its distribution of entity types is heavily uneven. Therefore, datasets with severely uneven distribution of entity types can only evaluate the performance of the models on a large number of distributed entity types.
5.5.2 Entity-Null Rate
Surprisingly, there are a large number of instances without any entities in many datasets such as OntoNotes4, MSRA, WNUT16, and OntoNotes5.
Although certain naturally distributed texts will contain some sentences without named entities, a high number of entity-free samples in a NER dataset makes it impossible to give a sufficient number of instances for NER model validation.
5.5.3 Overall validity
In general, CoNLL03 is the English NER dataset with the highest validity. As shown in Table 1, CoNLL03 has the lowest Entity-Null Rate (EnNullR), indicating that it can intensively test the entity recognition capabilities of NER models.
6. How do dataset properties affect model performance?
To validate the metrics and results under our statistical evaluation frameworkFootnote j and to further investigate how the statistical metric scores on dataset properties affect the model performance, we conduct controlled dataset adjustment in this section.
6.1 Models
For experiments on Chinese NER datasets, we use three models: 1) Lattice-LSTM (Zhang and Yang Reference Zhang and Yang2018), based on LSTM networks (Chiu and Nichols Reference Chiu and Nichols2016), which automatically identifies key words from the context; 2) Flat-Lattice (Li et al. Reference Li, Yan, Qiu and Huang2020), which converts the lattice structure into a flat structure; and 3) Roberta (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen and Stoyanov2019), a transformer-based pretrained model which removes the next sentence predict task in BERT.
For the English datasets, we also take three models, including: 1) LSTM CRF (Lample et al. Reference Lample, Ballesteros, Subramanian, Kawakami and Dyer2016), a traditional model based on the bidirectional LSTM with conditional random fields (CRF); 2) LUKE (Yamada et al. Reference Yamada, Asai, Shindo, Takeda and Matsumoto2020), which provides new pretrained contextualized representations of words and entities by predicting masked words and entities in entity-annotated corpus based on the bidirectional transformer (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez and Polosukhin2017); and 3) W2NER (Li et al. Reference Li, Yan, Qiu and Huang2020), which converts NER to word–word relationship classification and models the neighboring relations between entity words with Next-Neighboring-Word (NNW) and Tail-Head-Word (THW) relations.
6.2 Experiment settings
All the experiments are done on the NVIDIA RTX 2080 GPU and 3090 GPU and evaluated by seqeval.Footnote k Specifically, we utilize Micro F1 scores to measure the performance of the NER model. For the experiment with Train-Dev Dataset Adjustment (Sec. 6.4), we report the averaged results and variances over three random seeds.
6.2.1 Hyperparameters
In our research, we concentrated on refining model parameters and embedding techniques to boost performance. We chose a non-BERT variant of the Flat-Lattice model, which we enhanced with a CRF layer on Roberta. We also utilized the most effective version of LSTM CRF, notable for its use of pretrained word embeddings, character-level word modeling, and an optimized dropout rate.
Consistent with observations by Lample et al. (Reference Lample, Ballesteros, Subramanian, Kawakami and Dyer2016), we found that models using pretrained word embeddings typically surpass those with randomly initialized embeddings. Thus, we experimented with various word embedding methods to cover a broad range of approaches, as elaborated in 6.2.2.
Our study utilized various optimization algorithms. For instance, AdamW optimizer (Loshchilov and Hutter Reference Loshchilov and Hutter2017) was used for models like W2NER, Roberta, and LUKE. In contrast, models such as Lattice-LSTM, LSTM CRF, and Flat-Lattice were fine-tuned using stochastic gradient descent (SGD). Notably, both LUKE and W2NER models were further improved by combining AdamW with a learning rate warmup and linear decay strategy. LUKE also incorporated early stopping based on the development set performance. The specific hyperparameters for these models can be found in Appendix.
6.2.2 Word embeddings
-
• Static Word Embeddings: In the realm of static word embeddings, Lattice-LSTM utilizes its unique word,Footnote l character, and character bigram embeddings.Footnote m However, since LSTM CRF’s own pretrained embedding was unavailable, we opted for common-crawl vectors from FastText.Footnote n Similarly, Flat-Lattice employed the same pretrained embeddings as Lattice-LSTM.
-
• Dynamic Word Embeddings: Dynamic word embeddings represent a significant advancement over static embeddings, as they are context-sensitive and capable of capturing varying meanings of words in different contexts. Our approach prominently featured BERT-based embeddings, known for their extensive integration of grammatical, lexical, and semantic information. LUKE, for instance, introduced new pretrained contextualized representations of words and entities using Roberta. W2NER used bert-large-cased for English datasets and bert-base-chinese for Chinese datasets, taking advantage of Roberta’s refined capabilities as an optimized version of BERT.
6.3 Model replication results
We replicated six NER models in accordance with the experimental setup, and the results of the model replication are presented in Table 4 and 5.
repro. denotes reproduction. - denotes that the authors of the literature we cited did not experiment on that dataset. And ori. denotes original paper results.
repro. denotes reproduction. - denotes that the authors of the literature we cited did not experiment on that dataset. And ori. denotes original paper results.
6.4 Controlled dataset adjustment
To investigate how statistical properties affect model performance, we conducted controlled dataset adjustments: 1) we modified the test set to create two new sets (of the same size) with distinct statistical values for specific metrics (i.e., Test Dataset Adjustment). 2) Similarly, we adjusted the training and development sets to form new sets (of the same size) with distinguishable metrics values (i.e., Train-Dev Dataset Adjustment).
6.4.1 Test Dataset Adjustment
We adjusted the test set for three metrics: Leakage Ratio, Unseen Entity Ratio, and Entity Ambiguity Degree. This led to two new test sets with distinct statistical values for these metrics. For example, as for the Unseen Entity Ratio, we adjusted the test set to construct two new test sets, one with an Unseen Entity Ratio of 0.80 (80 percent) and the other with an Unseen Entity Ratio of 0.20 (20 percent), while ensuring that the two newly constructed test sets have the same number of instances.
6.4.2 Train-Dev Dataset Adjustment
Initially, we chose datasets with a high Entity-Null Rate (WNUT16, OntoNotes5 for English; Weibo, OntoNotes4 for Chinese). We then filtered the training and development sets to adjust the Entity-Null Rate to 0.20 (20 percent) and 0.80 (80 percent), ensuring equal numbers of instances in these subsets. Finally, we trained the data with various models before testing and comparing the results with the same test set.
6.5 Experiment results and analysis
-
• Datasets with high Unseen Entity Ratio are more difficult for NER models: Intuitively, those entities that were seen during training are less challenging for NER models compared to those that did not appear in the training set. Figure 4 supports our intuition. Models perform better on datasets with a lower proportion of unseen entities than on datasets with a relatively high proportion of unseen entities.
-
• Entities with strong Entity Ambiguity Degree are indeed more likely to confuse the model: We can infer from Figure 5 that datasets with a high Entity Ambiguity Degree are more challenging for the model. As for models tested on Chinese datasets, their average performance is 6.42 (F1) points higher on datasets with low entity ambiguity rates than on datasets with high entity ambiguity rates. The English NER model is more likely to be confused by entities with a high entity ambiguity rate and make wrong decisions.
-
• The models exhibit improved performance with increased test set leakage, highlighting the necessity for enhanced generalization in NER models: As shown in Table 6, three models (i.e., Lattice-LSTM, Flat-Lattice, and Roberta) consistently achieve better performance when the leakage rate of the test set is 0.80 (80 percent) than when it is 0.20 (20 percent). In particular, we found that the performance of Flat-Lattice on the Weibo test set with a Leakage Ratio of 0.80 (80 percent) outperformed the 0.20 (20 percent) by a large margin, that is, 25.69 percent. We speculate that because the model has seen the leaked data in the test set during training, it performs better on the test set with a relatively high data leakage rate. Looking at the experimental results from another perspective, researchers need to pay more attention to improving the NER model’s generalization ability.
-
• Entity-Null Rate plays a small difference: As shown in Tables 7 and 8, the F1 score of the training set and development set with EnNullR of 0.20 (20 percent) is better than 0.80 (80 percent). Therefore, we conclude that the contribution of instances without entities to the model is less than the instances with entities during training. However, are instances without any entities completely useless for model training? We delete all these instances and show the results in Tables 7 and 8. The performance of models trained on such datasets decreases, which indicates that the instances without entity are necessary, as they keep the distribution of the test set and training set relatively consistent.
LSTM represents Lattice-LSTM.
7. Discussion
Our statistical evaluation framework can be used to analyze the factors that affect the dataset’s quality and, furthermore, to build a higher-quality dataset in a targeted manner or augment the data with statistical improvement guidance. In this section, we take an initial step to analyze how the dataset construction process affects the statistical properties of datasets.
As shown in Table 9, based on an overview of the literature that presented the ten NER datasets, we provide a summary of how they were built. We can see that all datasets were created manually, with the exception of CLUENER and WikiAnn. As for CLUENER, Xu et al. (Reference Xu, Dong, Liao, Yu, Tian, Liu and Zhang2020) prelabel their dataset using the distant-supervised approach with a vocabulary and then manually check and modify some labels. WikiAnn is constructed using a cross-lingual name tagging framework based on a series of new Knowledge Base (KB) mining methods (Pan et al. Reference Pan, Zhang, May, Nothman, Knight and Ji2017).
We observe from Table 1 that only two of the ten NER datasets, CLUENER and WikiAnn, had Acc scores below 0.90 (90 percent), indicating that the NER dataset, which was not totally created manually, will have a significant number of annotation errors (shown in Figure 6).
8. Conclusion and future work
In this paper, we investigate various statistical properties of the NER datasets and propose a comprehensive dataset evaluation framework with nine statistical metrics. We implement a fine-grained evaluation of ten widely used NER datasets and provide a fair comparison of the existing datasets from three dimensions: reliability, difficulty, and validity. We further explore how the statistical properties of the training dataset influence the model performance and how dataset construction methods affect the dataset quality. In the future, we hope more works dive into dataset quality evaluation from a broader and more general perspective.
Competing interests
The authors declare no other competing interest.
Appendix A. Validation details of the metrics under our statistical evaluation framework
We justify and clarify those metrics under our evaluation framework that we have not discussed further in the main text.
• Redundancy: International data standardsFootnote o demand that data be unique. NLP dataset is a particular data type that must adhere to the same criteria as other data types.
• Accuracy: Numerous research have demonstrated that flaws in datasets will negatively impact the model’s performance (Zhu et al. Reference Zhu, Wu and Chen2003; Tejaswin et al. Reference Tejaswin, Naik and Liu2021; Gupta and Gupta Reference Gupta and Gupta2019). The model’s performance will increase to some extent after these mistakes are fixed (Zeng et al. Reference Zeng, Yu, Yu, Jiang and Jiang2021).
• Text Complexity: Several experiments of Fu et al. (Reference Fu, Liu and Neubig2020) on English NER datasets supported our use of Entity Density as a valid metric of the difficulty of the dataset. Their experiments showed that NER models are negatively correlated with Entity Density.
• Model Differentiation: This extrinsic metric aims to assess the dispersion of model scores on a unified benchmark dataset. As long as enough models are evaluated on the dataset, we can measure the differentiation of a dataset by calculating the dispersion of the scores of different models.
• Entity Imbalance Degree: There are category imbalances in many NLP tasks that can seriously affect the model’s performance on the long-tail instances (Blevins and Zettlemoyer Reference Blevins and Zettlemoyer2020; Zhang et al. Reference Zhang, Kang, Hooi, Yan and Feng2023; Wang et al. Reference Wang, Wang, Cheng, Gan, Jia, Li and Liu2020). Therefore, the Entity Imbalance Degree of the NER dataset is necessary and practical.
Appendix B. Details of manually checking data accuracy
We recommend selecting 100 instances from each dataset split and inviting at least three professional linguists who are volunteers to annotate the accuracy. Before the formal work, we conducted face-to-face training, such as introducing the standards of data proofreading.
Appendix C. Specific hyperparameters for our selected evaluation models