1. Introduction
The extraction of entities (or named entity recognition, NER) from unstructured texts is a very critical task in the field of natural language processing (NLP), which aims to automatically identify useful names or symbols of specific things such as the names of persons (PER), organizations (ORG), and locations (LOC) (Levow Reference Levow2006). In recent years, historical entity recognition that involves more types (i.e., the official title “OFI” as a quite key conception in the Chinese political system) has played a quite important role in relevant historical studies and projects, especially in the construction of Digital Humanities databases. Some automatic tools for Chinese historical entity extraction were developed to replace traditional manual annotation. “MARKUS” (De-Weerdt 2020) is a well-known text annotation system. It is designed for the string matching of named entities (i.e., personal names, place names, official titles, and time) through dictionary databases (i.e., CBDB and CHGIS) and supports users’ custom annotation in an interactive environment. Besides, “CKIP Tagger” (available at https://github.com/ckiplab/ckiptagger), an open-source NLP tool based on the deep neural network technology, can achieve promising performances on the NER task of Chinese (modern or ancient) or English corpora (Li, Fu, and Ma Reference Li, Fu and Ma2020).
However, the existing research and applications still encounter the following issues. To begin with, most of the dictionary-based methods are domain-specific, failing to parse other historical corpora with further intervention (Bingenheimer Reference Bingenheimer2015). Secondly, although the logograms of Chinese characters have rich knowledge of meanings (Meng et al. Reference Meng, Wu, Wang, Li, Nie, Yin, Li, Han, Sun and Li2019), there are few deep learning (DL)-based studies about the influence of morphological features (i.e., subword information) of Chinese characters on the entity extraction of Chinese historical texts. It should deserve far more attention. Finally, while several specific subword-based NER models (Luong, Socher, and Manning Reference Luong, Socher and Manning2013; Botha and Blunsom Reference Botha and Blunsom2014; Cao and Lu Reference Cao and Lu2017) have been shown to be efficient in English and modern Chinese corpora, their applicability for Chinese historical domains remains unknown, making building a flexible integration method based on various essential subword features challengeable (see Figure 1).
To address the problems raised above, we present a hybrid DL-based model for Chinese historical entity extraction that incorporates subword information and an attention mechanism. On the one hand, we use gated operations to control information flow across inner neural units in a cascaded structure that blends subword-based and pretrained character-based representations. This can improve the ability of a NER model to capture fine-granularity character semantic information in a comprehensive manner. On the other hand, the design of attention fusion based on the self-attention mechanism is employed to the global contextual representation of each sentence, which further gathers relational information among different characters from a variety of perspectives. To demonstrate the advantageous efficiency of our approach, relevant experiments are conducted on a monumental Chinese historical writing called “” (Comprehensive Mirror for Aid in Government, CMAG, seen at https://en.wikipedia.org/wiki/Zizhi_Tongjian ).
2. Related work
2.1. Deep neural network for Chinese entity extraction
Despite advances in NER based on traditional machine learning methods, such as maximum entropy (Tsai et al. Reference Tsai, Wu, Lee, Shih and Hsu2004; Leong et al. Reference Leong, Wong, Li and Dong2008), support vector machine (Li et al. Reference Li, Mao, Huang and Yang2006), and conditional random field (CRF) (Chen et al. Reference Chen, Peng, Shan and Sun2006; Zhang, Xu, and Zhang Reference Zhang, Xu and Zhang2008), recent studies have focused on the sequential labeling under the architecture of deep neural networks (Ma and Hovy Reference Ma and Hovy2016; E and Xiang 2017; Gui et al. Reference Gui, Ma, Zhang, Zhao, Jiang and Huang2019; Wu et al. 2019; Jia and Ma Reference Jia and Ma2019; Zhu and Wang Reference Zhu and Wang2019), which have yielded advanced results on a variety of Chinese corpora. Ma and Hovy, for instance, proposed a general end-to-end architecture “CNN-LSTM-CRF” that has achieved good performance in NER tasks (Ma and Hovy Reference Ma and Hovy2016). Shijia et al. employed the “BiLSTM” network to learn the sequential representation of Chinese sentences, combining character embedding and word embedding, and their proposed model “CWNE” produced a roughly 9% F1-score improvement over earlier methods (E and Xiang 2017). Noticing the Chinese word conflicts in higher-level convolutional layers, which were thought to have a detrimental influence upon the predictive accuracy, Gui et al. used a rethinking process to handle it and incorporated lexicon knowledge into a CNN-based model (“LR-CNN”) (Gui et al. Reference Gui, Ma, Zhang, Zhao, Jiang and Huang2019). To reduce the difficulty of out-of-vocabulary (“OOV”) and segmentation errors, Zhu and Wang (Reference Zhu and Wang2019) offered an optimized solution based on CNN-LSTM-CRF called “CAN-NER,” which merged local and global attention features, respectively, into CNN-based and GRU-based representations. This model has achieved a promising F1-score of 92.27% on the MSAR dataset.
However, many of the mentioned DL-based models (word-based) such as CWNE (E and Xiang 2017) and LR-CNN (Gui et al. Reference Gui, Ma, Zhang, Zhao, Jiang and Huang2019) cannot be directly used for historical entity extraction because of the deficiency of suitable word segmentation for Chinese historical texts. In addition, some models are too complex, even redundant, according to Ockham’s Razor Principle. For example, in CAN-NER, the enhancement of local attention characteristics is quite limited, making itself requires a significant number of additional training parameters. Our proposed model only incorporates the global attention mechanism (with fewer parameters) as a result of this.
2.2. Entity extraction in Chinese historical domains
Owing to the simplicity and speed, those dictionary-based methods are used to extract entities from Chinese historical texts. Similar to the mentioned MARKUS (De-Weerdt 2020), “LoGaRT” (Peng, Cheng, and Chen Reference Peng, Cheng and Chen2018) can provide multiple text processing functionalities, such as entity annotation, version manipulation, and task assignment for massive corpora of “” (CLG, Chinese Local Gazetteers), as well as modular support of regular expressions for those professional users.
With the advancement of statistical language models, CRF-based methods are applied to NER tasks in the Chinese historical domain (Xiong et al. Reference Xiong, Xu, Lu and Lo2014; Liu et al. Reference Liu, Huang, Wang and Bol2015; Long et al. Reference Long, Xiong, Lu, Li and Huang2016). For example, Xiong developed a CRF-based method for automatically identifying and extracting honorifics in pre-Qin literature, Tang Dynasty poems, and modern Chinese news reports (Xiong et al. Reference Xiong, Xu, Lu and Lo2014). However, their F1-score is not very high due to plenty of corpus noise. Liu et al. proposed an Ngram-based approach (Constrained N-Grams, CNGRAM) for the preliminary automatic annotation using the China Biographical Database (CBDB) and then further exploited multiple textual features (i.e., original characters and relative positions of named entities in CBDB) to optimize the traditional CRF model, which showed good results on CLG (Liu et al. Reference Liu, Huang, Wang and Bol2015).
As a typical DL-based technology, bidirectional long short-term memory (BiLSTM)-based solutions were presented to achieve better performance than CRF-based ones. Wu adopted the architecture of BiLSTM-CRF to recognize entities with a simple labeling strategy “BIO,” which reached an F1-score of 88.5% (Wu et al. Reference Wu, Zhao and Che2018). Li explored the “XOR” limitation of a stack of BiLSTM units and designed a more powerful model “Cross-BiLSTM-CNN” using a cross-context representation of LSTM hidden unites (Li, Fu, and Ma Reference Li, Fu and Ma2020). Their experiment showed competitive results compared to attention-based BiLSTM models. Besides, pretrained models such as BERT (Ji et al. 2020; Yu and Wang Reference Yu and Wang2021) have exhibited impressive technological advancement in recent efforts on the entity extraction of Chinese historical texts. Nevertheless, the results of these tentative studies needs to be further improved.
2.3. Subword-based approach for entity extraction
Many current studies have been concerned about the contribution of subword information in Chinese characters to the optimization of NER models (Yang et al. Reference Yang, Zhang, Liu, Zhou, Zhou and Sun2018; Cao et al. Reference Cao, Lu, Zhou and Li2018; Xu et al. Reference Xu, Wang, Han and Li2019; Yan and Wang Reference Yan and Wang2020; Wu, Song, and Feng Reference Wu, Song and Feng2021). Yang designed a Five-Stroke model based on the Wubi input method, which utilized the stroke information of Chinese characters to improve original character-based representation (Yang et al. Reference Yang, Zhang, Liu, Zhou, Zhou and Sun2018). Cao directly converted the stroke sequential information into a hybrid embedding representation, which helped general DL-based NER models perform better (Cao et al. Reference Cao, Lu, Zhou and Li2018). Xu proposed a radical-based model combined with a number of local representation at multiple granularities, facilitating the NER model with higher performance (Xu et al. Reference Xu, Wang, Han and Li2019). MECT is a recent advanced NER model using cross-transformers, in which the character-level representation was enhanced with the knowledge of Chinese structural components (Wu, Song, and Feng Reference Wu, Song and Feng2021). Some studies looked into more effective neural architectures based on the multi-feature ensemble approaches (Chaudhary et al. Reference Chaudhary, Zhou, Levin, Neubig, Mortensen and Carbonell2018; Zhang et al. Reference Zhang, Liu, Zhu, Zhen, Liu, Wang, Chen and Zhai2019; Gong et al. Reference Gong, Li, Xia, Chen and Zhang2020). For example, Gong developed a tree-structure representation based on a hierarchical long short-term memory (HiLSTM) framework by extracting the features of characters, subwords, and context-aware predicted words (Gong et al. Reference Gong, Li, Xia, Chen and Zhang2020). Their experiments showed that the proposed model has achieved significant improvement over several benchmark methods.
Notwithstanding the attractive and outstanding outcomes of the above subword-based approaches for the simplified Chinese-based texts, almost no relevant research on the Chinese historical entity extraction (i.e., in traditional Chinese corpora) has been reported with these approaches. Yan’s rudimentary work (Yan and Wang Reference Yan and Wang2020) paid some attention to the feature integration of character radical and structure but unfortunately ignored the deep semantic relationship between characters and their subwords, as well as the positive effects of implicit general knowledge from pretrained models. Besides, most subword-based ensemble solutions are complicated and time-consuming, and they lack further detailed ablation research on their novel components, which may reduce the credibility of experimental results.
3. Methodology
3.1. The framework of model
To keep the balance of efficiency and complexity, we propose an efficient subword-based ensemble network (SEN) based on “CNN-LSTM-CRF” (Ma and Hovy Reference Ma and Hovy2016), which mainly consists of five layers, namely the input layer, the subword-based integrative layer (SIL), the sequential representative layer, the attention fusion layer (AFL), and the output layer. Compared to “CNN-LSTM-CRF,” our model creates two additional modules: SIL is developed to integrate meaningful subword information of characters with multiple CNN blocks in a cascaded structure into the character-based representation based on a pretrained model, and AFL is applied to capture diverse relation knowledge among characters by the global mapping of different semantic spaces (see in Figure 2). Hence, we elaborate on each layer of “SEN” in the following subsections.
3.2. Input layer
In the phase of text preprocessing, we unify all the character forms by converting simplified Chinese characters and variant characters to the regular traditional Chinese characters. Also, a heuristic reduction algorithm for extracting flat entities called “Longest Entity Match” (LEM) is developed, which can tackle the problem of nested entities (Byrne Reference Byrne2007) in html-formatted texts, as illustrated in Algorithm 1. The method can pick out the longest character-based string for a certain entity, which can ensure the wholeness of the entity semantics and the rapidity of implementation (e.g., with a python html parser named BeautifulSoup4). It should be noted that the parameter maxDepth is designed to regulate the depth of traversal, which can preserve the efficiency and prevent the stack overflows during iterations.
Additionally, for the one-hot representation of input sentences, we replace those low-frequency and erroneous characters with the label of “[UNK]” and pad each sentence with the label of “[PAD]” to a maximum length $n$ . Let $x$ denote an input sequence composed of characters ${x_i}$ , which then can be formulated as $x = \left( {{x_1},{x_2},{x_3}, \ldots ,{x_i},{\rm{\;}} \ldots ,{x_n}} \right)$ .
3.3. Subword-based integrative layer
The SIL is proposed to formulate the hybrid embedding representation that has a more powerful ability to capture the token-level local representation with rich character-based morphological features. It can be divided into three core components, that is, a pretrained embedding network for character representation, a subword-based neural network that involves radical-based neural blocks (RNB) and character structure-based neural blocks (CSNB), and a hybrid cascaded network that offers a flexible network topology for the integration of these subword-based blocks.
Pretrained embedding network. As the most typical fine-turning-based model, BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) has been shown to be robustly effective on many downstream NLP tasks, especially for NER. The technique of mask language modeling and next sentence prediction can help to significantly strengthen BERT’s contextual understanding, under the condition that BERT assembles many powerful attention-based networks known as “Transformer.” Let $e$ denote the BERT-based embedding of $x$ with $e = \left( {{e_1},{e_2},{e_3}, \ldots ,{e_i},{\rm{\;}} \ldots ,{e_n}} \right)$ , as follows:
Subword-based neural network. The proposed RNB aims to detect the internal graphemic similarity of different Chinese characters, since those characters with the same radical always have higher grammatical and semantic similarity (Sun et al. Reference Sun, Lin, Yang, Ji and Wang2014). Moreover, similar to the idea of MECT (Wu, Song, and Feng Reference Wu, Song and Feng2021), we also think character structural information is quite crucial in the relative positional encoding for the character components of each character in our newly designed CSNB, which can improve the semantic representation of characters’ entire morphology. However, different from MECT using all the deconstructed structural components as the radical-level representation, our solution only adopts the radical component (i.e., “”—Bushou) as the main ideographic unit (RNB) and the structural classification of Chinese characters as a complementary feature (CSNB). They can help to provide more accurate semantic information for NER without erroneously overpredicting the word analog of shaped-nearly characters (i.e., “” denoted as SNC), compared to MECT. To name a few, according to Chinese conventional Wu-Xing Theory or Five-Element Theory (i.e., “”), the meaning of the character “” (leaf) that is related to person names mainly depends on its radical component “” (grass) rather than the other structural components. As Watson’s investigation found that Chinese personal names also have a unique quality: if one of the five elements is missing or not properly balanced for someone, his/her name may then include a character with the radical for that element (Watson Reference Watson1986). Likewise, “” (branch) with the same radical often occurs in a Chinese person name (e.g., a well-known diplomat of the Han Dynasty named “”—Wu Su or a prefectural governor of the Tang Dynasty named “”—Shichang Su) but quite dissimilar to its SNC like “” (butterfly) that is not usually used as an ancient person entity. That is the merit of RNB which MECT does not possess. Considering another three similar characters “” (Lake), “” (River in Northern China), and “” (River in Southern China) that are closely related to named entities (e.g., water areas), CSNB can further capture their fine-grained semantic difference and make the characters (“” and “”) with the same “Left-Right” structure semantically closer to each other. We show how it’s done using an example in Figure 3.
Let $r$ , $s$ denote the radical-based embedding representation and the character structural embedding representation of sequence $x$ , respectively, where $r = \left( {{r_1},{r_2},{r_3}, \ldots ,{r_i},{\rm{\;}} \ldots ,{r_n}} \right)$ and $s = \left( {{s_1},{s_2},{s_3}, \ldots ,{s_i},{\rm{\;}} \ldots ,{s_n}} \right)$ . Therefore, the output of RNB ( ${c_{RNB}}$ ) and CSNB ( ${c_{CSNB}}$ ) can be calculated based on the convolutional network ( $Conv$ ), as follows:
Hybrid cascaded network. Unlike the superficial integration using a simple concatenation of different subword information (Xu et al. Reference Xu, Wang, Han and Li2019; Zhang et al. Reference Zhang, Liu, Zhu, Zhen, Liu, Wang, Chen and Zhai2019; Yan and Wang Reference Yan and Wang2020) or the complicated semantic interaction between characters and a single subword with huge computation (Wu, Song, and Feng Reference Wu, Song and Feng2021), we design a flexible gate-based cascaded architecture with the appropriate complexity for subword-based integration. It is composed of multiple convolutional sub-networks, which incorporate different levels of subword-based blocks into the character embedding representation (see in Figure 4). By concatenating the hybrid representation (i.e., the first, the second, and the third level of combinational representation denoted as ${c_1}$ , ${c_2}$ , and ${c_3}$ ), this network can extract different levels of combinational features $c$ layer by layer. We further strengthen the local nonlinear learning ability by using the Gate Linear Unit (GLU) (Dauphin et al. Reference Dauphin, Fan, Auli and Grangier2017) which allows for the intelligent selection of crucial features and diminishes the vanishing gradient problem.
Let $e$ denote the character embedding of $x$ and $e\epsilon R^{v\times d}$ , where $V$ is the size of vocabulary and $d$ is the embedding dimension), $GLU_e^1$ , $GLU_{RNB}^1$ and $GLU_{CSNB}^1$ denote the output of different input features ( $e$ , ${c_{RNB}}$ , and ${c_{CSNB}}$ ) with the $GLU$ activation in the first level. Hence, ${c_1}$ can be illustrated as follows:
where $\sigma $ is the sigmoid function and $Concat$ refers to the operation of concatenation. Similarly, assuming that $GLU_f^k$ denotes the $k$ -th level output of a feature $f$ with the GLU activation, a higher level of cascaded representation ${c_k}$ can be deduced based on the previous level ${c_{k - 1}}$ as:
where $F$ is the number of fusion features that is equal to the difference of the number of input features (here is 3). Noted that the input features only contain one main feature (i.e., the character embedding) which is the target that other subword features (character radical and character structure) need to be concatenated with.
3.4. Sequential representation layer
In the sequential representation layer, we adopt a powerful bidirectional RNN (BiLSTM) (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997) which combines the forward and backward sequential information for encoding sentences (Zhou et al. Reference Zhou, Huang, Guo, Hu and Han2019). LSTM utilizes a shared memory channel to retrain historical information at previous time steps and establishes different gate mechanisms to avoid the problem of gradient vanishing in general RNNs. Let $\overrightarrow {LSTM} $ and $\overleftarrow {LSTM} $ denote LSTM forward and backward networks, respectively. The output of BiLSTM-based sequential representation can be computed as the following formula:
where $B$ is the output of BiLSTM with $B = \left( {{l_1},{l_2},{l_3}, \ldots ,{l_n}} \right)$ and $l_{i}\epsilon R^{d_{l}}$ , ${d_l}$ is the output dimension of BiLSTM.
3.5. Attention fusion layer
Similar to many previous NER studies (Jia and Ma Reference Jia and Ma2019; Zhu and Wang Reference Zhu and Wang2019; Jin et al. Reference Jin, Xie, Guo, Luo, Wu and Wang2019), we first use multiple heads of self-attention networks in AFL, which can enhance the cross-similarity representation of characters in a global sentence-level context. As a basic attention unit, SDPA (Scaled Doted Product Attention) was proposed to quickly capture long-range dependency under the circumstances of parallelized computation (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), which can overcome the shortcomings of RNN-based and CNN-based blocks. Let ${Q_B}$ , ${K_B}$ , and ${V_B}$ be the query representation, the key representation, and the value representation of $B$ , respectively, through different trainable linear mapping, which ultimately yields the output attention score based on a softmax function ( $softmax$ ). A single SDPA ( $Att$ ) can be calculated as follows:
Moreover, multi-head SDPA ( $mul\_Att$ ) can be viewed as an extensive version, which projects the relation representation to diverse dimensional spaces. Let $At{t_j}$ denote the $j$ -th SDPA, $h$ is the number of heads, as well as a trainable parameter matrix $W_{a}\epsilon R^{hd_{b} \times d_{b}}$ , we can concatenate them to the $mul\_Att$ :
However, a dilemma known as “Low-Rank Bottleneck” (Bhojanapalli 2020) exists, which is echoed by the header size $h$ : due to the low-dimensional matrix of $Q_{B}^{T}\cdot K_{B}$ with only $2n \times \left( {{d_b}/h} \right)$ parameters, the increase of $h$ will limit the contextual representation of AFL ( $2n \times \left( {{d_b}/h} \right) \ll {n^2}$ ), but reducing $h$ will adversely lead to the loss of its diverse expressive ability. To alleviate it, we exploit a fusion-weighed matrix $\tau(\tau\epsilon R^{h \times h}) $ to automatically adjust the distribution of different SDPAs, instead of using a direct concatenation operation. It can improve the global representation by enforcing information interaction among SDPAs. Let the $j$ -th matrix $Z_{j} = Q_{B,(j)}^{T}\cdot K_{B,(j)}/\sqrt{d_{l}}$ , the final output of AFL can be computed as:
where $W_z^T$ ( ${R^{n \times n}}$ ) and ${b_z}$ ( ${R^n}$ ) are the linear transformation parameters to be learned.
3.6. Output layer
In the decoding part, a CRF-based network (Lafferty, McCallum, and Pereira Reference Lafferty, McCallum and Pereira2001) is adopted to generate the output probability of each label. Let $y$ denote the output labels with $y = \!\left( {{y_1},{y_2},{y_3}, \ldots ,{y_i}, \ldots ,{y_n}} \right)$ . Given two probabilities based on the Markov assumption, namely, the emission probability ${Q_{i,{y_i}}}$ assigned to the label ${y_i}$ and the transmission probability ${A_{{y_{i - 1}},{y_i}}}$ from ${y_{i - 1}}$ to ${y_i}$ , we can calculate the score of a CRF feature function using the addition operation:
Then the standardized score of a possible path ${P_{{y_i}|{x_i}}}$ can be further computed:
where ${T_y}$ is the number of entity labels. The object function is based on the maximum likelihood estimation, defined as Formula 16. During the inference process, we use the Viterbi Algorithm (Forney Reference Forney1973) to decode the signals in our model and make label prediction.
4. Experimental analysis
4.1. Dataset preparation
Data source. CMAG is the first chronological history writing in the Chinese history, spanning roughly 1400 years from 403 B.C. to 959 A.D. According to the evolutional historical timeline, the content of this masterpiece can be divided into 16 classes of “Annals” (), each of which has described a range of important historical facts and events throughout a specific dynasty, as shown in Table 1.
To generate a standard historical corpus for NER, we collect CMAG texts from a massive online dataset called Chinese Core Texts (CCT, available at http://www.xueheng.net/), which has 200 Chinese classics (with about 80 million characters), including the Twenty-four Histories (), the Thirteen Confucian Classics (), and so forth. These texts are then labeled, which are subsequently validated and revised by domain experts. The result of a preliminary test of inter-rater agreement for randomly selected 2000 sentences shows that their annotation result is statistically consistent (Kappa = 0.875, P-value < 0.05). Moreover, our proposed method “LEM” for identifying flat historical entities can reach a conversion rate of 100%, demonstrating its rationality. The CMAG dataset can be found online (available at https://github.com/MescoCoder/AncientChineseProject/).
Subword collection. After collecting basic Chinese character information of 25,974 common traditional Chinese characters from a popular public website named “Hancheng” (assessed by http://tool.httpcn.com/), which includes the order of stroke, radical components, character structure information, and Wubi coding (seen in https://en.wikipedia.org/wiki/Wubi_method), we establish a subword-based lexicon database for Chinese characters (SLDCC). Figure 5 depicts the statistical results of radical components and character structures in SLDCC, with just the top 30 most frequently occurring radical components displayed. The remaining ones are denoted as “other.” Obviously, “” (4.57%), “” (4.55%), and “” (4.55%) are the most commonly used radicals. The “left-right” structure is the most frequent morphological structure (68.53%), followed by the “up-bottom” structure (20.80%), and any other type of character structure is no more than 2%.
Data processing. In addition, we use a standard annotation scheme “BIOES” (Cho et al. Reference Cho, Okazaki, Miwa and Tsujii2013) where “B”, “I”, “E”, and “O”, respectively, signify an entity type based on the positional feature of a character (like the beginning, inside, ending, and outside), and “S” denotes a special type with just a single character. Given Chinese historical knowledge, we take three crucial entity categories into consideration (i.e., the names of persons as “PER,” the name of official titles as “OFI,” and the name of locations as “LOC”).
Figure 6 shows the proportion of these three entities as well as the total entity number (#Entity) in different annals, which implies an imbalanced data distribution of entities in CMAG. For example, due to the largest number of volumes (#Volumn), TD has the most entities, considerably far outnumbering other annals. The number of OFI in each annal is the lowest of all entity categories. As a result, we shuffle and partition all the sentences of the corpus into three experimental sets (the training set, the development set, and the testing set) using annal-based proportional sampling to avoid the imbalance bias of the data source. The detailed information about each set is shown in Table 2.
4.2. Experimental settings
Hyper Parameters. For the aspect of embedding representation, all of the feature dimension numbers of the character representation, radical-based representation, and character structural representation are set to 256. We construct four pretrained language models, namely Word2vec CBOW (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013), GloVe (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014), and BERTgoogle and BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019). The models of Word2vec CBOW and GloVe are trained using the whole collection of CCT, with a contextual window of 5 and an embedding dimension of 256. BERTgoogle refers to the Google official Chinese version, while BERT denotes a new version of BERT which is incrementally trained on BERTgoogle using the whole collection of CCT. Because BERTgoogle is primarily supplied to simplified Chinese texts, BERT, which contains more character semantics both in simplified and traditional Chinese, is a better fit for NER in Chinese historical texts. Additionally, our model uses CNNs with a kernel size of 3, a BiLSTM block with 100 hidden neural units, and an attention fusion network with 8 heads. The dropout rate is set to 0.2.
We repeat the training process of each model three times to ensure robust results, in which the total epoch, the mini-batch size in each epoch, and the learning rate are set to 50, 64, and 0.001, respectively. During the training, we choose Adam (Kingma and Ba Reference Kingma and Ba2014) as the network optimizer. To avoid overfitting, we use the mechanism of early stopping and L2-norm regularization as the previous study did (Yan and Wang Reference Yan and Wang2020). Finally, we compute the average metrics of models’ performance after model training for evaluating purposes. All the experiments are implemented by Tensorflow and run on the Google Colab with a Tesla P100 GPU and 25G VRAM.
Baseline Models. To show the advantage of our model, we compare it to other prominent NER models in both the Chinese historical and general domains. These models can be categorized into four groups, namely the models with a CRF-based decoder (“CRF-D-M”), the models with an RNN-based decoder (“RNN-D-M”), the subword-based models (“S-M”), and the models based on BERT (“BERT-M”).
As for CRF-D-M, we take the traditional unigram-character CRF model (“CRF-Base”) and the other two variant models as the baselines. For example, “CRF-Bigram” uses the bigram pattern as the feature template, while the part-of-speech features of characters are put into “CRF-Bigram-PS” based on CRF-Bigram. Moreover, two more DL-based encoders, “BiLSTM” and “CNN-BiLSTM” (Ma and Hovy 2014), are implanted into the CRF-based neural network (“BiLSTM-CRF” and “CNN-BiLSTM-CRF”), where CRF is viewed as a sequential decoder part of the whole architecture. In alternative to CRF, RNN-D-M uses the block of RNNs as the decoder, which has powerful predictability for entity labels and provides a faster coverage for the training process. Here, we choose “Stack-CNN-BiLSTM” (Chiu and Nichols Reference Chiu and Nichols2016) and “Cross-CNN-BiLSTM” (Li, Fu, and Ma Reference Li, Fu and Ma2020), since they have already shown promising results. As previously said, S-M captures numerous morphological features of Chinese characters which help models learn better. We implement the “Five-stroke” model with the Chinese stroke structural information (Yang et al. Reference Yang, Zhang, Liu, Zhou, Zhou and Sun2018), the “ME-CNER” model with the radical information (Cao et al. Reference Cao, Lu, Zhou and Li2018), and “Stroke-ngram” model with the n-gram sequential information of Chinese strokes (Xu et al. Reference Xu, Wang, Han and Li2019). For BERT-M, we also consider different simple variants like BERTgoogle, BERT, and “BERT-BiLSTM-CRF” (Yu and Wang Reference Yu and Wang2021), as well as the subword-based BERT versions such as “Five-stroke+BERT,” “ME-CNER+BERT,” and “Stroke-ngram+BERT.”
4.3. Comparative analysis
As a comprehensive evaluative measurement, F1-score ( ${F_1}$ ) can make an assessment of the entire performance of a model, combined with the precision score ( $Precision$ ) and the recall score ( $Recall$ ) (Li et al. Reference Li, Fu and Ma2020) (see Formula 15). According to different calculation methods, F1-micro and F1-marco respectively refer to the mean score of ${F_1}{\rm{\;}}$ in different entity categories and the ${F_1}$ value of all the entities as a whole:
Overall evaluation. We conduct several comparative experiments and evaluative analyses between our model “SEN” and the mentioned models, shown in Tables 3, 4, 5, and 6. SEN achieves the highest F1-scores, both in terms of F1-micro denoted as “W-micro” (93.87%) and F1-marco as “W-macro” (89.70%), which outperforms all other models. Specifically, SEN achieves better results (5.00% increase for F1-micro and 7.19% increase for F1-marco) than the best CRF-D-M model (CNN-BiLSTM-CRF). As the best RNN-D-M model, Cross-CNN-BiLSTM utilizes double layers of BiLSTMs compared to only one BiLSTM block in SEN, but its performance is still backward. BERT does significantly improve the performance of common NER models and subword-based models. In particular, Stroke-ngram+BERT attains the relative greater F1-micro (92.28%) and F1-marco (88.31%), but it still remains inferior to SEN. Comparing the topological similarity of these models, we can infer that all of the performance gains are linked to our new proposed blocks, namely the SIL and AFL.
Entity evaluation. We explore the effectiveness of SEN across different entity categories and annals. First, we examine the F1 performance of the four best models in CRF-D-M, RNN-D-M, S-M, and BERT-M, and our model on three entity categories (PER, LOC, and OFI). Table 7 shows the F1-score of our model remains ahead of other models in different categories. The performance peaks at 94.04%, 94.13%, and 80.92%, respectively, for LOC (W-LOC), PER (W-PER), and OFI (W-OFI). The highest score performed on LOC is clearly visible, while SEN has the largest growth of 12.53% on OFI compared to CNN-BiLSTM-CRF. This suggests that SEN is less sensitive to the categories with fewer annotation labels than others.
Second, we continue to analyze the details of SEN’s performance, combining the entity numbers and annal types, as shown in Figure 7. The Pearson correlation between the F1-score (both in micro and macro level) and the number of entity of annals is not statistically significant (P-value > 0.05), which implies that the total number of entities may be not the influential factor for the performance of our model. Similar trends are also observed in the types of PER, OFI, and LOC, but their correlation of F1-score is significantly positive ( ${r_{pearson}}$ = 0.8086, P-value < 0.05). In addition, a joint correlation analysis between the two sub-figures in Figure 7 illustrates that the correlation between the score of W-micro and the F1-score of OFI is significantly positively correlated ( ${r_{pearson}}$ = 0.9876, P-value < 0.05); thus, we can learn that the distinctive F1-score fluctuation of SEN on CMAG annals is most likely to be affected by the OFI category with the smallest entity number.
4.4. Ablation studies
To demonstrate the effectiveness of our proposed novel blocks, we carry out some ablation experiments on the crucial neural blocks that previous NER works in Chinese history have ignored.
Comparative analysis on crucial neural blocks. Table 8 presents the ablation results that when SIL is not utilized in SEN, the whole F1-score drops dramatically (5.25%↓ for F1-micro and 5.89%↓ for F1-marco), compared to a less decline in the case of our model without AFL (0.62%↓ for F1-micro and 1.07%↓ for F1-marco). If SIL and AFL are both eliminated from SEN, the performance moves downward to a lower level with 88.59% for F1-micro and 82.82% for F1-marco. This proves that incorporating subword-based integrative network and attention fusion mechanism in a reasonable manner can improve the DL-based NER model. Moreover, the subword-based integrative information (i.e., SIL) plays a more important role in this improvement.
Impact of SIL. We conduct extra experimental analysis on the effectiveness of the proposed subword-based network and BERT-based mechanisms. The results of “SEN–SILBERT” and “SEN–SILSubword” in Table 7 indicate that neglecting BERT’s character embedding or SEN’s subword-based embedding can both undermine the model performance. However, when SEN is used in conjunction with other traditional pretrained approaches (i.e., GloVe and Word2vec CBOW), there is no discernible improvement in performance. This demonstrates the benefits of BERT and the effectiveness of our proposed subword-based network in Chinese historical NER tasks.
Additionally, we investigate the impact of cascaded levels in the subword-based network on our model F1-score. It is manifest in Figure 8 that with the increase of the cascaded level, the general trend of each F1-score rises up, though the change between the two-level cascade and the three-level cascade is not so significant. This proves that the cascaded structural integration can work for the enhancement of the entire performance but at a diminishing marginal utility.
Impact of the AFL. To understand the contribution of our proposed attention fusion mechanism, we give a comparative analysis of two methods: one is our proposal denoted as “Attention Fusion,” while the other is the normal multi-head attention network denoted as “Normal Attention.” According to the evaluative results in Table 9, we find the contribution of Normal Attention is very small since its performance is nearly equal to “SEN–AFL” which excludes AFL in Table 8. This may attribute to the reuse of multi-head attention in BERT. Attention Fusion, by contrast, can still improve the ultimate performance in this scenario, showing the significance of weighed integration among different heads of attention subnetworks.
Furthermore, we take the visualization of attention weights in AFL for a deep probe, by comparing these two forms of attention methods. Considering three example sentences in different lengths in Table 10, we can obtain the visual presentation of their attention weights. Their result in Figure 9 indicates that regardless of the sentence length, both can capture the long-term semantic relation for a sentence that the CNN-based SIL and the BiLSTM-based sequential representative layer cannot obtain. Compared to Normal Attention, Attention Fusion has a better capacity for the accurate and diverse representation of those key information related to entities. For example, except for “” (Peng) as the part of a LOC entity for “Pengcheng County,” more meaningful characters are focused only by Attention Fusion, such as “” (Jiang) and “” (Zhou) as the part of a LOC entity for “Jiangdu” and a PER entity for “Duke of Zhou,” respectively. Also, Attention Fusion pays close attention to the potential feature characters like “” (served_for), “” (warned) and “” (has_a_family_name) which are overlooked by Normal Attention. In fact, they are functional characters that are more likely to appear in the vicinity of an entity, which is quite beneficial to the generalization ability of a NER model. Besides, it turns out to be a more uniform distribution in longer sentences for Normal Attention, which may lead to the failure of the attention mechanism. Conversely, such weakness is not obvious in Attention Fusion.
4.5. Case studies
To show the advantageous performance of our model clearly, we conduct several case studies with two mentioned advanced approaches namely BERT and Stroke-ngram+BERT, and three versions of ours. From Table 11, we find that SEN has the most accurate entity recognition of all categories across different annals. The following example comparison and error analysis can provide some necessary explanations for the advantage.
On the one hand, SEN correctly identifies the potential entities with lower frequency (e.g., the OFI entity “” in SuD and “” in TD) that BERT and Stroke-ngram+BERT erroneously ignore. For BERT, Stroke-ngram+BERT, and “SEN–SIL” that takes no consideration of subword-based integration, there are some wrongly predictive labels, probably due to the ambiguity of non-entity tokens with high frequency in special sentence positions. For instance, according to the syntax of nominal coordinate structures, the character “” (meaning “always” as an adverb here) is easy to be wrongly identified as a PER entity, since it was followed by a conjunction “” (meaning “and”) and a person name “” in HD. But SEN successfully avoids it, possibly because “” has no obvious entity-related radical or structural information which reveals it belongs to a person entity. This provides proof of the effectiveness of SIL that we propose.
On the other hand, those models without the attention fusion mechanism, such as BERT, Stroke-ngram+BERT, and “SEN–AFL” have issues with missing entity labeling. For example, they fail to recognize continuous low-frequency OFI entities like “” and “”, since a high probability of entity label is only assigned to some characters with high frequency which yield some incomplete entities, such as “” and “”. This indicates that the suggested attention fusion of SEN (i.e., AFL) has a superior ability to discover longer-term semantic associations among different Chinese characters than the normal multi-head attention network.
5. Conclusion
In this study, we propose SEN, a new NER model adopting a hybrid end-to-end neural architecture including an input layer, a SIL, a BiLSTM-based sequential layer, an AFL, and a CRF-based output layer. Except for a few common layers that other popular models have, SIL strengthens the character-level representation hierarchically by incorporating the knowledge of BERT, as well as the radical and character structural information, while AFL further captures the diverse global relation semantics among different characters through the fusion of attention networks.
To show the effectiveness of SEN, we first utilize the MARKUS tool to construct a standard NER corpus CMAG based on a great Chinese historical work Comprehensive Mirror for Aid in Government. During the generative process of CMAG, a heuristic reduction algorithm called LEM is developed to solve the problem of nested entities. Moreover, we carry out several experiments on CMAG which indicates that SEN outperforms other popular models in terms of F1-micro and F1-marco, including BERT-based or other subword-based models. Specifically, SEN performs the best on all the given entity types, especially obtaining the largest improvement of performance on OFI. It indicates that SEN is not so sensitive as other DL-based models to the problem of imbalanced classes. Additionally, the correlation analysis shows that the performance of our model varies dramatically across different annals, which is mainly caused by the OFI category with the smallest entity number.
From ablation studies, we learn that despite both SIL and AFL can contribute to the improvement of our model, SIL is more critical. Also, the effectiveness of BERT-based and subword-based blocks in SIL (i.e., BERT-based and subword-based) has been proved, in which the cascaded structure of SIL is an important factor (i.e., the larger the cascade level is, the better the SEN preforms). For AFL, we can easily see the superiority of the proposed attention fusion on the representation of entity semantic information in a global way, compared to normal multi-head attention networks. The visualization for attention weights offers clear evidence and reliable interpretability for the powerfully accurate prediction ability of our attention fusion, which was not discussed in previous NER works about Chinese history.
To the best of our knowledge, this research is one of the earliest pioneering works about subword-enhancing entity extraction in the Chinese historical domain, which effectively integrates the subword information of Chinese characters and global optimization of the attention mechanism. It provides a promising NER solution with state-of-the-art performance, which can also be extended to other digital humanities domains.
Acknowledgments
This research is supported by China Postdoctoral Science Foundation (No. 2021M703564) and National Social Science Foundation of China (No. 18CTQ041).