Hostname: page-component-586b7cd67f-t7czq Total loading time: 0 Render date: 2024-11-22T14:30:09.681Z Has data issue: false hasContentIssue false

SEN: A subword-based ensemble network for Chinese historical entity extraction

Published online by Cambridge University Press:  22 December 2022

Chengxi Yan
Affiliation:
School of Information Resource Management, Renmin University of China, Beijing, China Research Center for Digital Humanities of RUC, Beijing, China
Ruojia Wang*
Affiliation:
School of Management, Beijing University of Chinese Medicine, Beijing, China
Xiaoke Fang
Affiliation:
College of Applied Arts and Science, Beijing Union University, Beijing, China
*
*Corresponding author. E-mail: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Understanding various historical entity information (e.g., persons, locations, and time) plays a very important role in reasoning about the developments of historical events. With the increasing concern about the fields of digital humanities and natural language processing, named entity recognition (NER) provides a feasible solution for automatically extracting these entities from historical texts, especially in Chinese historical research. However, previous approaches are domain-specific, ineffective with relatively low accuracy, and non-interpretable, which hinders the development of NER in Chinese history. In this paper, we propose a new hybrid deep learning model called “subword-based ensemble network” (SEN), by incorporating subword information and a novel attention fusion mechanism. The experiments on a massive self-built Chinese historical corpus CMAG show that SEN has achieved the best with 93.87% for F1-micro and 89.70% for F1-macro, compared with other advanced models. Further investigation reveals that SEN has a strong generalization ability of NER on Chinese historical texts, which is not only relatively insensitive to the categories with fewer annotation labels (e.g., OFI) but can also accurately capture diverse local and global semantic relations. Our research demonstrates the effectiveness of the integration of subword information and attention fusion, which provides an inspiring solution for the practical use of entity extraction in the Chinese historical domain.

Type
Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

1. Introduction

The extraction of entities (or named entity recognition, NER) from unstructured texts is a very critical task in the field of natural language processing (NLP), which aims to automatically identify useful names or symbols of specific things such as the names of persons (PER), organizations (ORG), and locations (LOC) (Levow Reference Levow2006). In recent years, historical entity recognition that involves more types (i.e., the official title “OFI” as a quite key conception in the Chinese political system) has played a quite important role in relevant historical studies and projects, especially in the construction of Digital Humanities databases. Some automatic tools for Chinese historical entity extraction were developed to replace traditional manual annotation. “MARKUS” (De-Weerdt 2020) is a well-known text annotation system. It is designed for the string matching of named entities (i.e., personal names, place names, official titles, and time) through dictionary databases (i.e., CBDB and CHGIS) and supports users’ custom annotation in an interactive environment. Besides, “CKIP Tagger” (available at https://github.com/ckiplab/ckiptagger), an open-source NLP tool based on the deep neural network technology, can achieve promising performances on the NER task of Chinese (modern or ancient) or English corpora (Li, Fu, and Ma Reference Li, Fu and Ma2020).

However, the existing research and applications still encounter the following issues. To begin with, most of the dictionary-based methods are domain-specific, failing to parse other historical corpora with further intervention (Bingenheimer Reference Bingenheimer2015). Secondly, although the logograms of Chinese characters have rich knowledge of meanings (Meng et al. Reference Meng, Wu, Wang, Li, Nie, Yin, Li, Han, Sun and Li2019), there are few deep learning (DL)-based studies about the influence of morphological features (i.e., subword information) of Chinese characters on the entity extraction of Chinese historical texts. It should deserve far more attention. Finally, while several specific subword-based NER models (Luong, Socher, and Manning Reference Luong, Socher and Manning2013; Botha and Blunsom Reference Botha and Blunsom2014; Cao and Lu Reference Cao and Lu2017) have been shown to be efficient in English and modern Chinese corpora, their applicability for Chinese historical domains remains unknown, making building a flexible integration method based on various essential subword features challengeable (see Figure 1).

Figure 1. Entity information in Sima Guang’s incomplete manuscript of CMAG (owned by the National Library of China). Noted that the official entity of “” (OFI) may have semantic relation with their subword information in each character.

To address the problems raised above, we present a hybrid DL-based model for Chinese historical entity extraction that incorporates subword information and an attention mechanism. On the one hand, we use gated operations to control information flow across inner neural units in a cascaded structure that blends subword-based and pretrained character-based representations. This can improve the ability of a NER model to capture fine-granularity character semantic information in a comprehensive manner. On the other hand, the design of attention fusion based on the self-attention mechanism is employed to the global contextual representation of each sentence, which further gathers relational information among different characters from a variety of perspectives. To demonstrate the advantageous efficiency of our approach, relevant experiments are conducted on a monumental Chinese historical writing called “” (Comprehensive Mirror for Aid in Government, CMAG, seen at https://en.wikipedia.org/wiki/Zizhi_Tongjian ).

2. Related work

2.1. Deep neural network for Chinese entity extraction

Despite advances in NER based on traditional machine learning methods, such as maximum entropy (Tsai et al. Reference Tsai, Wu, Lee, Shih and Hsu2004; Leong et al. Reference Leong, Wong, Li and Dong2008), support vector machine (Li et al. Reference Li, Mao, Huang and Yang2006), and conditional random field (CRF) (Chen et al. Reference Chen, Peng, Shan and Sun2006; Zhang, Xu, and Zhang Reference Zhang, Xu and Zhang2008), recent studies have focused on the sequential labeling under the architecture of deep neural networks (Ma and Hovy Reference Ma and Hovy2016; E and Xiang 2017; Gui et al. Reference Gui, Ma, Zhang, Zhao, Jiang and Huang2019; Wu et al. 2019; Jia and Ma Reference Jia and Ma2019; Zhu and Wang Reference Zhu and Wang2019), which have yielded advanced results on a variety of Chinese corpora. Ma and Hovy, for instance, proposed a general end-to-end architecture “CNN-LSTM-CRF” that has achieved good performance in NER tasks (Ma and Hovy Reference Ma and Hovy2016). Shijia et al. employed the “BiLSTM” network to learn the sequential representation of Chinese sentences, combining character embedding and word embedding, and their proposed model “CWNE” produced a roughly 9% F1-score improvement over earlier methods (E and Xiang 2017). Noticing the Chinese word conflicts in higher-level convolutional layers, which were thought to have a detrimental influence upon the predictive accuracy, Gui et al. used a rethinking process to handle it and incorporated lexicon knowledge into a CNN-based model (“LR-CNN”) (Gui et al. Reference Gui, Ma, Zhang, Zhao, Jiang and Huang2019). To reduce the difficulty of out-of-vocabulary (“OOV”) and segmentation errors, Zhu and Wang (Reference Zhu and Wang2019) offered an optimized solution based on CNN-LSTM-CRF called “CAN-NER,” which merged local and global attention features, respectively, into CNN-based and GRU-based representations. This model has achieved a promising F1-score of 92.27% on the MSAR dataset.

However, many of the mentioned DL-based models (word-based) such as CWNE (E and Xiang 2017) and LR-CNN (Gui et al. Reference Gui, Ma, Zhang, Zhao, Jiang and Huang2019) cannot be directly used for historical entity extraction because of the deficiency of suitable word segmentation for Chinese historical texts. In addition, some models are too complex, even redundant, according to Ockham’s Razor Principle. For example, in CAN-NER, the enhancement of local attention characteristics is quite limited, making itself requires a significant number of additional training parameters. Our proposed model only incorporates the global attention mechanism (with fewer parameters) as a result of this.

2.2. Entity extraction in Chinese historical domains

Owing to the simplicity and speed, those dictionary-based methods are used to extract entities from Chinese historical texts. Similar to the mentioned MARKUS (De-Weerdt 2020), “LoGaRT” (Peng, Cheng, and Chen Reference Peng, Cheng and Chen2018) can provide multiple text processing functionalities, such as entity annotation, version manipulation, and task assignment for massive corpora of “” (CLG, Chinese Local Gazetteers), as well as modular support of regular expressions for those professional users.

With the advancement of statistical language models, CRF-based methods are applied to NER tasks in the Chinese historical domain (Xiong et al. Reference Xiong, Xu, Lu and Lo2014; Liu et al. Reference Liu, Huang, Wang and Bol2015; Long et al. Reference Long, Xiong, Lu, Li and Huang2016). For example, Xiong developed a CRF-based method for automatically identifying and extracting honorifics in pre-Qin literature, Tang Dynasty poems, and modern Chinese news reports (Xiong et al. Reference Xiong, Xu, Lu and Lo2014). However, their F1-score is not very high due to plenty of corpus noise. Liu et al. proposed an Ngram-based approach (Constrained N-Grams, CNGRAM) for the preliminary automatic annotation using the China Biographical Database (CBDB) and then further exploited multiple textual features (i.e., original characters and relative positions of named entities in CBDB) to optimize the traditional CRF model, which showed good results on CLG (Liu et al. Reference Liu, Huang, Wang and Bol2015).

As a typical DL-based technology, bidirectional long short-term memory (BiLSTM)-based solutions were presented to achieve better performance than CRF-based ones. Wu adopted the architecture of BiLSTM-CRF to recognize entities with a simple labeling strategy “BIO,” which reached an F1-score of 88.5% (Wu et al. Reference Wu, Zhao and Che2018). Li explored the “XOR” limitation of a stack of BiLSTM units and designed a more powerful model “Cross-BiLSTM-CNN” using a cross-context representation of LSTM hidden unites (Li, Fu, and Ma Reference Li, Fu and Ma2020). Their experiment showed competitive results compared to attention-based BiLSTM models. Besides, pretrained models such as BERT (Ji et al. 2020; Yu and Wang Reference Yu and Wang2021) have exhibited impressive technological advancement in recent efforts on the entity extraction of Chinese historical texts. Nevertheless, the results of these tentative studies needs to be further improved.

2.3. Subword-based approach for entity extraction

Many current studies have been concerned about the contribution of subword information in Chinese characters to the optimization of NER models (Yang et al. Reference Yang, Zhang, Liu, Zhou, Zhou and Sun2018; Cao et al. Reference Cao, Lu, Zhou and Li2018; Xu et al. Reference Xu, Wang, Han and Li2019; Yan and Wang Reference Yan and Wang2020; Wu, Song, and Feng Reference Wu, Song and Feng2021). Yang designed a Five-Stroke model based on the Wubi input method, which utilized the stroke information of Chinese characters to improve original character-based representation (Yang et al. Reference Yang, Zhang, Liu, Zhou, Zhou and Sun2018). Cao directly converted the stroke sequential information into a hybrid embedding representation, which helped general DL-based NER models perform better (Cao et al. Reference Cao, Lu, Zhou and Li2018). Xu proposed a radical-based model combined with a number of local representation at multiple granularities, facilitating the NER model with higher performance (Xu et al. Reference Xu, Wang, Han and Li2019). MECT is a recent advanced NER model using cross-transformers, in which the character-level representation was enhanced with the knowledge of Chinese structural components (Wu, Song, and Feng Reference Wu, Song and Feng2021). Some studies looked into more effective neural architectures based on the multi-feature ensemble approaches (Chaudhary et al. Reference Chaudhary, Zhou, Levin, Neubig, Mortensen and Carbonell2018; Zhang et al. Reference Zhang, Liu, Zhu, Zhen, Liu, Wang, Chen and Zhai2019; Gong et al. Reference Gong, Li, Xia, Chen and Zhang2020). For example, Gong developed a tree-structure representation based on a hierarchical long short-term memory (HiLSTM) framework by extracting the features of characters, subwords, and context-aware predicted words (Gong et al. Reference Gong, Li, Xia, Chen and Zhang2020). Their experiments showed that the proposed model has achieved significant improvement over several benchmark methods.

Notwithstanding the attractive and outstanding outcomes of the above subword-based approaches for the simplified Chinese-based texts, almost no relevant research on the Chinese historical entity extraction (i.e., in traditional Chinese corpora) has been reported with these approaches. Yan’s rudimentary work (Yan and Wang Reference Yan and Wang2020) paid some attention to the feature integration of character radical and structure but unfortunately ignored the deep semantic relationship between characters and their subwords, as well as the positive effects of implicit general knowledge from pretrained models. Besides, most subword-based ensemble solutions are complicated and time-consuming, and they lack further detailed ablation research on their novel components, which may reduce the credibility of experimental results.

3. Methodology

3.1. The framework of model

To keep the balance of efficiency and complexity, we propose an efficient subword-based ensemble network (SEN) based on “CNN-LSTM-CRF” (Ma and Hovy Reference Ma and Hovy2016), which mainly consists of five layers, namely the input layer, the subword-based integrative layer (SIL), the sequential representative layer, the attention fusion layer (AFL), and the output layer. Compared to “CNN-LSTM-CRF,” our model creates two additional modules: SIL is developed to integrate meaningful subword information of characters with multiple CNN blocks in a cascaded structure into the character-based representation based on a pretrained model, and AFL is applied to capture diverse relation knowledge among characters by the global mapping of different semantic spaces (see in Figure 2). Hence, we elaborate on each layer of “SEN” in the following subsections.

Algorithm 1. Longest Entity Match

Figure 2. The architecture of SEN.

3.2. Input layer

In the phase of text preprocessing, we unify all the character forms by converting simplified Chinese characters and variant characters to the regular traditional Chinese characters. Also, a heuristic reduction algorithm for extracting flat entities called “Longest Entity Match” (LEM) is developed, which can tackle the problem of nested entities (Byrne Reference Byrne2007) in html-formatted texts, as illustrated in Algorithm 1. The method can pick out the longest character-based string for a certain entity, which can ensure the wholeness of the entity semantics and the rapidity of implementation (e.g., with a python html parser named BeautifulSoup4). It should be noted that the parameter maxDepth is designed to regulate the depth of traversal, which can preserve the efficiency and prevent the stack overflows during iterations.

Additionally, for the one-hot representation of input sentences, we replace those low-frequency and erroneous characters with the label of “[UNK]” and pad each sentence with the label of “[PAD]” to a maximum length $n$ . Let $x$ denote an input sequence composed of characters ${x_i}$ , which then can be formulated as $x = \left( {{x_1},{x_2},{x_3}, \ldots ,{x_i},{\rm{\;}} \ldots ,{x_n}} \right)$ .

3.3. Subword-based integrative layer

The SIL is proposed to formulate the hybrid embedding representation that has a more powerful ability to capture the token-level local representation with rich character-based morphological features. It can be divided into three core components, that is, a pretrained embedding network for character representation, a subword-based neural network that involves radical-based neural blocks (RNB) and character structure-based neural blocks (CSNB), and a hybrid cascaded network that offers a flexible network topology for the integration of these subword-based blocks.

Pretrained embedding network. As the most typical fine-turning-based model, BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) has been shown to be robustly effective on many downstream NLP tasks, especially for NER. The technique of mask language modeling and next sentence prediction can help to significantly strengthen BERT’s contextual understanding, under the condition that BERT assembles many powerful attention-based networks known as “Transformer.” Let $e$ denote the BERT-based embedding of $x$ with $e = \left( {{e_1},{e_2},{e_3}, \ldots ,{e_i},{\rm{\;}} \ldots ,{e_n}} \right)$ , as follows:

(1) \begin{align} e = BERT\!\left( x \right)\end{align}

Subword-based neural network. The proposed RNB aims to detect the internal graphemic similarity of different Chinese characters, since those characters with the same radical always have higher grammatical and semantic similarity (Sun et al. Reference Sun, Lin, Yang, Ji and Wang2014). Moreover, similar to the idea of MECT (Wu, Song, and Feng Reference Wu, Song and Feng2021), we also think character structural information is quite crucial in the relative positional encoding for the character components of each character in our newly designed CSNB, which can improve the semantic representation of characters’ entire morphology. However, different from MECT using all the deconstructed structural components as the radical-level representation, our solution only adopts the radical component (i.e., “”—Bushou) as the main ideographic unit (RNB) and the structural classification of Chinese characters as a complementary feature (CSNB). They can help to provide more accurate semantic information for NER without erroneously overpredicting the word analog of shaped-nearly characters (i.e., “” denoted as SNC), compared to MECT. To name a few, according to Chinese conventional Wu-Xing Theory or Five-Element Theory (i.e., “”), the meaning of the character “” (leaf) that is related to person names mainly depends on its radical component “” (grass) rather than the other structural components. As Watson’s investigation found that Chinese personal names also have a unique quality: if one of the five elements is missing or not properly balanced for someone, his/her name may then include a character with the radical for that element (Watson Reference Watson1986). Likewise, “” (branch) with the same radical often occurs in a Chinese person name (e.g., a well-known diplomat of the Han Dynasty named “”—Wu Su or a prefectural governor of the Tang Dynasty named “”—Shichang Su) but quite dissimilar to its SNC like “” (butterfly) that is not usually used as an ancient person entity. That is the merit of RNB which MECT does not possess. Considering another three similar characters “” (Lake), “” (River in Northern China), and “” (River in Southern China) that are closely related to named entities (e.g., water areas), CSNB can further capture their fine-grained semantic difference and make the characters (“” and “”) with the same “Left-Right” structure semantically closer to each other. We show how it’s done using an example in Figure 3.

Figure 3. Subword-based neural block.

Let $r$ , $s$ denote the radical-based embedding representation and the character structural embedding representation of sequence $x$ , respectively, where $r = \left( {{r_1},{r_2},{r_3}, \ldots ,{r_i},{\rm{\;}} \ldots ,{r_n}} \right)$ and $s = \left( {{s_1},{s_2},{s_3}, \ldots ,{s_i},{\rm{\;}} \ldots ,{s_n}} \right)$ . Therefore, the output of RNB ( ${c_{RNB}}$ ) and CSNB ( ${c_{CSNB}}$ ) can be calculated based on the convolutional network ( $Conv$ ), as follows:

(2) \begin{align} {c_{RNB}} = Conv\!\left( r \right) \end{align}
(3) \begin{align} {c_{CSNB}} = Conv\!\left( s \right) \end{align}

Hybrid cascaded network. Unlike the superficial integration using a simple concatenation of different subword information (Xu et al. Reference Xu, Wang, Han and Li2019; Zhang et al. Reference Zhang, Liu, Zhu, Zhen, Liu, Wang, Chen and Zhai2019; Yan and Wang Reference Yan and Wang2020) or the complicated semantic interaction between characters and a single subword with huge computation (Wu, Song, and Feng Reference Wu, Song and Feng2021), we design a flexible gate-based cascaded architecture with the appropriate complexity for subword-based integration. It is composed of multiple convolutional sub-networks, which incorporate different levels of subword-based blocks into the character embedding representation (see in Figure 4). By concatenating the hybrid representation (i.e., the first, the second, and the third level of combinational representation denoted as ${c_1}$ , ${c_2}$ , and ${c_3}$ ), this network can extract different levels of combinational features $c$ layer by layer. We further strengthen the local nonlinear learning ability by using the Gate Linear Unit (GLU) (Dauphin et al. Reference Dauphin, Fan, Auli and Grangier2017) which allows for the intelligent selection of crucial features and diminishes the vanishing gradient problem.

Figure 4. Hybrid cascaded network.

Let $e$ denote the character embedding of $x$ and $e\epsilon R^{v\times d}$ , where $V$ is the size of vocabulary and $d$ is the embedding dimension), $GLU_e^1$ , $GLU_{RNB}^1$ and $GLU_{CSNB}^1$ denote the output of different input features ( $e$ , ${c_{RNB}}$ , and ${c_{CSNB}}$ ) with the $GLU$ activation in the first level. Hence, ${c_1}$ can be illustrated as follows:

(4) \begin{align} GLU_{RNB}^{1} = c_{RNB} \cdot (1-\sigma(c_{RNB})) + \sigma(c_{RNB}) \cdot c_{RNB} \end{align}
(5) \begin{align} GLU_{CSNB}^{1} = c_{CSNB} \cdot (1-\sigma(c_{CSNB})) + \sigma(c_{CSNB}) \cdot c_{CSNB} \end{align}
(6) \begin{align} GLU_{e}^{1} = e \cdot (1-\sigma(e)) + \sigma(e) \cdot e \end{align}
(7) \begin{align} {c_1} = Concat\!\left( {GLU_{RNB}^1,GLU_{CSNB}^1,\;GLU_e^1} \right) \end{align}

where $\sigma $ is the sigmoid function and $Concat$ refers to the operation of concatenation. Similarly, assuming that $GLU_f^k$ denotes the $k$ -th level output of a feature $f$ with the GLU activation, a higher level of cascaded representation ${c_k}$ can be deduced based on the previous level ${c_{k - 1}}$ as:

(8) \begin{align} GLU_{f}^{k} = c_{k-1} \cdot (1-\sigma(c_{k-1})) + \sigma(c_{k-1}) \cdot c_{k-1} \end{align}
(9) \begin{align} c_{k} = Concat\left(\frac{GLU_{1}^{k},GLU_{2}^{k},\ldots,GLU_{f}^{k},\ldots}{F}\right) \end{align}

where $F$ is the number of fusion features that is equal to the difference of the number of input features (here is 3). Noted that the input features only contain one main feature (i.e., the character embedding) which is the target that other subword features (character radical and character structure) need to be concatenated with.

3.4. Sequential representation layer

In the sequential representation layer, we adopt a powerful bidirectional RNN (BiLSTM) (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997) which combines the forward and backward sequential information for encoding sentences (Zhou et al. Reference Zhou, Huang, Guo, Hu and Han2019). LSTM utilizes a shared memory channel to retrain historical information at previous time steps and establishes different gate mechanisms to avoid the problem of gradient vanishing in general RNNs. Let $\overrightarrow {LSTM} $ and $\overleftarrow {LSTM} $ denote LSTM forward and backward networks, respectively. The output of BiLSTM-based sequential representation can be computed as the following formula:

(10) \begin{align} B = Concat\!\left( {\overrightarrow {LSTM} \!\left( c \right),\overleftarrow {LSTM} \!\left( c \right)} \right) \end{align}

where $B$ is the output of BiLSTM with $B = \left( {{l_1},{l_2},{l_3}, \ldots ,{l_n}} \right)$ and $l_{i}\epsilon R^{d_{l}}$ , ${d_l}$ is the output dimension of BiLSTM.

3.5. Attention fusion layer

Similar to many previous NER studies (Jia and Ma Reference Jia and Ma2019; Zhu and Wang Reference Zhu and Wang2019; Jin et al. Reference Jin, Xie, Guo, Luo, Wu and Wang2019), we first use multiple heads of self-attention networks in AFL, which can enhance the cross-similarity representation of characters in a global sentence-level context. As a basic attention unit, SDPA (Scaled Doted Product Attention) was proposed to quickly capture long-range dependency under the circumstances of parallelized computation (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), which can overcome the shortcomings of RNN-based and CNN-based blocks. Let ${Q_B}$ , ${K_B}$ , and ${V_B}$ be the query representation, the key representation, and the value representation of $B$ , respectively, through different trainable linear mapping, which ultimately yields the output attention score based on a softmax function ( $softmax$ ). A single SDPA ( $Att$ ) can be calculated as follows:

(11) \begin{align} Att = softmax \left(\frac{Q_{B}^{T}\cdot K_{B},}{\sqrt{d_{l}}} \right) \cdot V_{B} \end{align}

Moreover, multi-head SDPA ( $mul\_Att$ ) can be viewed as an extensive version, which projects the relation representation to diverse dimensional spaces. Let $At{t_j}$ denote the $j$ -th SDPA, $h$ is the number of heads, as well as a trainable parameter matrix $W_{a}\epsilon R^{hd_{b} \times d_{b}}$ , we can concatenate them to the $mul\_Att$ :

(12) \begin{align} mul\_{ATT} = W_{a}^{T} \cdot Concat(Att_{1},Att_{2},\ldots,Att_{j},\ldots,Att_{h})\end{align}

However, a dilemma known as “Low-Rank Bottleneck” (Bhojanapalli 2020) exists, which is echoed by the header size $h$ : due to the low-dimensional matrix of $Q_{B}^{T}\cdot K_{B}$ with only $2n \times \left( {{d_b}/h} \right)$ parameters, the increase of $h$ will limit the contextual representation of AFL ( $2n \times \left( {{d_b}/h} \right) \ll {n^2}$ ), but reducing $h$ will adversely lead to the loss of its diverse expressive ability. To alleviate it, we exploit a fusion-weighed matrix $\tau(\tau\epsilon R^{h \times h}) $ to automatically adjust the distribution of different SDPAs, instead of using a direct concatenation operation. It can improve the global representation by enforcing information interaction among SDPAs. Let the $j$ -th matrix $Z_{j} = Q_{B,(j)}^{T}\cdot K_{B,(j)}/\sqrt{d_{l}}$ , the final output of AFL can be computed as:

(13) \begin{align} Att\_ fusion = W_{a}^{T} \cdot Concat((W_{a}^{T} \cdot softmax(\tau \cdot [Z_{1},Z_{2},\ldots,Z_{j},\ldots,Z_{h}])+b_{z}) \cdot V_{B,(j)}) \end{align}

where $W_z^T$ ( ${R^{n \times n}}$ ) and ${b_z}$ ( ${R^n}$ ) are the linear transformation parameters to be learned.

3.6. Output layer

In the decoding part, a CRF-based network (Lafferty, McCallum, and Pereira Reference Lafferty, McCallum and Pereira2001) is adopted to generate the output probability of each label. Let $y$ denote the output labels with $y = \!\left( {{y_1},{y_2},{y_3}, \ldots ,{y_i}, \ldots ,{y_n}} \right)$ . Given two probabilities based on the Markov assumption, namely, the emission probability ${Q_{i,{y_i}}}$ assigned to the label ${y_i}$ and the transmission probability ${A_{{y_{i - 1}},{y_i}}}$ from ${y_{i - 1}}$ to ${y_i}$ , we can calculate the score of a CRF feature function using the addition operation:

(14) \begin{align} s\!\left( {{x_i},{y_i}} \right) = \mathop \sum \nolimits_{i = 1}^n {Q_{i,{y_i}}} + \mathop \sum \nolimits_{i = 1}^{n - 1} {A_{{y_{i - 1}},{y_i}}} \end{align}

Then the standardized score of a possible path ${P_{{y_i}|{x_i}}}$ can be further computed:

(15) \begin{align} {P_{{y_i}|{x_i}}} = \;\frac{{{e^{s\left( {{x_i},{y_i}} \right)}}}}{{\mathop \sum \nolimits_{j = 1}^{{T_y}} {e^{s\left( {{x_i},{y_j}} \right)}}}} \end{align}

where ${T_y}$ is the number of entity labels. The object function is based on the maximum likelihood estimation, defined as Formula 16. During the inference process, we use the Viterbi Algorithm (Forney Reference Forney1973) to decode the signals in our model and make label prediction.

(16) \begin{align} \mathop{argmax}_{y_{i}\epsilon {y},x_{i}\epsilon {x}} P_{yi|x_{i}} \end{align}

4. Experimental analysis

4.1. Dataset preparation

Data source. CMAG is the first chronological history writing in the Chinese history, spanning roughly 1400 years from 403 B.C. to 959 A.D. According to the evolutional historical timeline, the content of this masterpiece can be divided into 16 classes of “Annals” (), each of which has described a range of important historical facts and events throughout a specific dynasty, as shown in Table 1.

Table 1. The information of annals in CMAG

To generate a standard historical corpus for NER, we collect CMAG texts from a massive online dataset called Chinese Core Texts (CCT, available at http://www.xueheng.net/), which has 200 Chinese classics (with about 80 million characters), including the Twenty-four Histories (), the Thirteen Confucian Classics (), and so forth. These texts are then labeled, which are subsequently validated and revised by domain experts. The result of a preliminary test of inter-rater agreement for randomly selected 2000 sentences shows that their annotation result is statistically consistent (Kappa = 0.875, P-value < 0.05). Moreover, our proposed method “LEM” for identifying flat historical entities can reach a conversion rate of 100%, demonstrating its rationality. The CMAG dataset can be found online (available at https://github.com/MescoCoder/AncientChineseProject/).

Subword collection. After collecting basic Chinese character information of 25,974 common traditional Chinese characters from a popular public website named “Hancheng” (assessed by http://tool.httpcn.com/), which includes the order of stroke, radical components, character structure information, and Wubi coding (seen in https://en.wikipedia.org/wiki/Wubi_method), we establish a subword-based lexicon database for Chinese characters (SLDCC). Figure 5 depicts the statistical results of radical components and character structures in SLDCC, with just the top 30 most frequently occurring radical components displayed. The remaining ones are denoted as “other.” Obviously, “” (4.57%), “” (4.55%), and “” (4.55%) are the most commonly used radicals. The “left-right” structure is the most frequent morphological structure (68.53%), followed by the “up-bottom” structure (20.80%), and any other type of character structure is no more than 2%.

Figure 5. The statistic results of SLDCC.

Data processing. In addition, we use a standard annotation scheme “BIOES” (Cho et al. Reference Cho, Okazaki, Miwa and Tsujii2013) where “B”, “I”, “E”, and “O”, respectively, signify an entity type based on the positional feature of a character (like the beginning, inside, ending, and outside), and “S” denotes a special type with just a single character. Given Chinese historical knowledge, we take three crucial entity categories into consideration (i.e., the names of persons as “PER,” the name of official titles as “OFI,” and the name of locations as “LOC”).

Figure 6 shows the proportion of these three entities as well as the total entity number (#Entity) in different annals, which implies an imbalanced data distribution of entities in CMAG. For example, due to the largest number of volumes (#Volumn), TD has the most entities, considerably far outnumbering other annals. The number of OFI in each annal is the lowest of all entity categories. As a result, we shuffle and partition all the sentences of the corpus into three experimental sets (the training set, the development set, and the testing set) using annal-based proportional sampling to avoid the imbalance bias of the data source. The detailed information about each set is shown in Table 2.

Figure 6. The statistic results of entities by annals.

4.2. Experimental settings

Hyper Parameters. For the aspect of embedding representation, all of the feature dimension numbers of the character representation, radical-based representation, and character structural representation are set to 256. We construct four pretrained language models, namely Word2vec CBOW (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013), GloVe (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014), and BERTgoogle and BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019). The models of Word2vec CBOW and GloVe are trained using the whole collection of CCT, with a contextual window of 5 and an embedding dimension of 256. BERTgoogle refers to the Google official Chinese version, while BERT denotes a new version of BERT which is incrementally trained on BERTgoogle using the whole collection of CCT. Because BERTgoogle is primarily supplied to simplified Chinese texts, BERT, which contains more character semantics both in simplified and traditional Chinese, is a better fit for NER in Chinese historical texts. Additionally, our model uses CNNs with a kernel size of 3, a BiLSTM block with 100 hidden neural units, and an attention fusion network with 8 heads. The dropout rate is set to 0.2.

We repeat the training process of each model three times to ensure robust results, in which the total epoch, the mini-batch size in each epoch, and the learning rate are set to 50, 64, and 0.001, respectively. During the training, we choose Adam (Kingma and Ba Reference Kingma and Ba2014) as the network optimizer. To avoid overfitting, we use the mechanism of early stopping and L2-norm regularization as the previous study did (Yan and Wang Reference Yan and Wang2020). Finally, we compute the average metrics of models’ performance after model training for evaluating purposes. All the experiments are implemented by Tensorflow and run on the Google Colab with a Tesla P100 GPU and 25G VRAM.

Table 2. The detailed information about three sets

Baseline Models. To show the advantage of our model, we compare it to other prominent NER models in both the Chinese historical and general domains. These models can be categorized into four groups, namely the models with a CRF-based decoder (“CRF-D-M”), the models with an RNN-based decoder (“RNN-D-M”), the subword-based models (“S-M”), and the models based on BERT (“BERT-M”).

As for CRF-D-M, we take the traditional unigram-character CRF model (“CRF-Base”) and the other two variant models as the baselines. For example, “CRF-Bigram” uses the bigram pattern as the feature template, while the part-of-speech features of characters are put into “CRF-Bigram-PS” based on CRF-Bigram. Moreover, two more DL-based encoders, “BiLSTM” and “CNN-BiLSTM” (Ma and Hovy 2014), are implanted into the CRF-based neural network (“BiLSTM-CRF” and “CNN-BiLSTM-CRF”), where CRF is viewed as a sequential decoder part of the whole architecture. In alternative to CRF, RNN-D-M uses the block of RNNs as the decoder, which has powerful predictability for entity labels and provides a faster coverage for the training process. Here, we choose “Stack-CNN-BiLSTM” (Chiu and Nichols Reference Chiu and Nichols2016) and “Cross-CNN-BiLSTM” (Li, Fu, and Ma Reference Li, Fu and Ma2020), since they have already shown promising results. As previously said, S-M captures numerous morphological features of Chinese characters which help models learn better. We implement the “Five-stroke” model with the Chinese stroke structural information (Yang et al. Reference Yang, Zhang, Liu, Zhou, Zhou and Sun2018), the “ME-CNER” model with the radical information (Cao et al. Reference Cao, Lu, Zhou and Li2018), and “Stroke-ngram” model with the n-gram sequential information of Chinese strokes (Xu et al. Reference Xu, Wang, Han and Li2019). For BERT-M, we also consider different simple variants like BERTgoogle, BERT, and “BERT-BiLSTM-CRF” (Yu and Wang Reference Yu and Wang2021), as well as the subword-based BERT versions such as “Five-stroke+BERT,” “ME-CNER+BERT,” and “Stroke-ngram+BERT.”

4.3. Comparative analysis

As a comprehensive evaluative measurement, F1-score ( ${F_1}$ ) can make an assessment of the entire performance of a model, combined with the precision score ( $Precision$ ) and the recall score ( $Recall$ ) (Li et al. Reference Li, Fu and Ma2020) (see Formula 15). According to different calculation methods, F1-micro and F1-marco respectively refer to the mean score of ${F_1}{\rm{\;}}$ in different entity categories and the ${F_1}$ value of all the entities as a whole:

(17) \begin{align} F_{1} = 2 \times \frac{precision \cdot Recall}{precision + Recall} \end{align}

Overall evaluation. We conduct several comparative experiments and evaluative analyses between our model “SEN” and the mentioned models, shown in Tables 3, 4, 5, and 6. SEN achieves the highest F1-scores, both in terms of F1-micro denoted as “W-micro” (93.87%) and F1-marco as “W-macro” (89.70%), which outperforms all other models. Specifically, SEN achieves better results (5.00% increase for F1-micro and 7.19% increase for F1-marco) than the best CRF-D-M model (CNN-BiLSTM-CRF). As the best RNN-D-M model, Cross-CNN-BiLSTM utilizes double layers of BiLSTMs compared to only one BiLSTM block in SEN, but its performance is still backward. BERT does significantly improve the performance of common NER models and subword-based models. In particular, Stroke-ngram+BERT attains the relative greater F1-micro (92.28%) and F1-marco (88.31%), but it still remains inferior to SEN. Comparing the topological similarity of these models, we can infer that all of the performance gains are linked to our new proposed blocks, namely the SIL and AFL.

Table 3. Comparison with CRF-D-M models

Table 4. Comparison with RNN-D-M models

Table 5. Comparison with S-M models

Table 6. Comparison with BERT-M models

Entity evaluation. We explore the effectiveness of SEN across different entity categories and annals. First, we examine the F1 performance of the four best models in CRF-D-M, RNN-D-M, S-M, and BERT-M, and our model on three entity categories (PER, LOC, and OFI). Table 7 shows the F1-score of our model remains ahead of other models in different categories. The performance peaks at 94.04%, 94.13%, and 80.92%, respectively, for LOC (W-LOC), PER (W-PER), and OFI (W-OFI). The highest score performed on LOC is clearly visible, while SEN has the largest growth of 12.53% on OFI compared to CNN-BiLSTM-CRF. This suggests that SEN is less sensitive to the categories with fewer annotation labels than others.

Table 7. The F1-score on different entity categories

Figure 7. The F1-score of SEN on CMAG annals.

Second, we continue to analyze the details of SEN’s performance, combining the entity numbers and annal types, as shown in Figure 7. The Pearson correlation between the F1-score (both in micro and macro level) and the number of entity of annals is not statistically significant (P-value > 0.05), which implies that the total number of entities may be not the influential factor for the performance of our model. Similar trends are also observed in the types of PER, OFI, and LOC, but their correlation of F1-score is significantly positive ( ${r_{pearson}}$ = 0.8086, P-value < 0.05). In addition, a joint correlation analysis between the two sub-figures in Figure 7 illustrates that the correlation between the score of W-micro and the F1-score of OFI is significantly positively correlated ( ${r_{pearson}}$ = 0.9876, P-value < 0.05); thus, we can learn that the distinctive F1-score fluctuation of SEN on CMAG annals is most likely to be affected by the OFI category with the smallest entity number.

4.4. Ablation studies

To demonstrate the effectiveness of our proposed novel blocks, we carry out some ablation experiments on the crucial neural blocks that previous NER works in Chinese history have ignored.

Comparative analysis on crucial neural blocks. Table 8 presents the ablation results that when SIL is not utilized in SEN, the whole F1-score drops dramatically (5.25%↓ for F1-micro and 5.89%↓ for F1-marco), compared to a less decline in the case of our model without AFL (0.62%↓ for F1-micro and 1.07%↓ for F1-marco). If SIL and AFL are both eliminated from SEN, the performance moves downward to a lower level with 88.59% for F1-micro and 82.82% for F1-marco. This proves that incorporating subword-based integrative network and attention fusion mechanism in a reasonable manner can improve the DL-based NER model. Moreover, the subword-based integrative information (i.e., SIL) plays a more important role in this improvement.

Table 8. The results of ablation analysis

Figure 8. The influence of cascaded level.

Impact of SIL. We conduct extra experimental analysis on the effectiveness of the proposed subword-based network and BERT-based mechanisms. The results of “SEN–SILBERT” and “SEN–SILSubword” in Table 7 indicate that neglecting BERT’s character embedding or SEN’s subword-based embedding can both undermine the model performance. However, when SEN is used in conjunction with other traditional pretrained approaches (i.e., GloVe and Word2vec CBOW), there is no discernible improvement in performance. This demonstrates the benefits of BERT and the effectiveness of our proposed subword-based network in Chinese historical NER tasks.

Additionally, we investigate the impact of cascaded levels in the subword-based network on our model F1-score. It is manifest in Figure 8 that with the increase of the cascaded level, the general trend of each F1-score rises up, though the change between the two-level cascade and the three-level cascade is not so significant. This proves that the cascaded structural integration can work for the enhancement of the entire performance but at a diminishing marginal utility.

Table 9. The results of ablation analysis

Table 10. The description of random sentence examples

Figure 9. Visualization of attention weights. Note that the darker color means higher weights.

Table 11. The examples of NER on CMAG (green characters are correctly identified by a model, while red ones are erroneous.)

Impact of the AFL. To understand the contribution of our proposed attention fusion mechanism, we give a comparative analysis of two methods: one is our proposal denoted as “Attention Fusion,” while the other is the normal multi-head attention network denoted as “Normal Attention.” According to the evaluative results in Table 9, we find the contribution of Normal Attention is very small since its performance is nearly equal to “SEN–AFL” which excludes AFL in Table 8. This may attribute to the reuse of multi-head attention in BERT. Attention Fusion, by contrast, can still improve the ultimate performance in this scenario, showing the significance of weighed integration among different heads of attention subnetworks.

Furthermore, we take the visualization of attention weights in AFL for a deep probe, by comparing these two forms of attention methods. Considering three example sentences in different lengths in Table 10, we can obtain the visual presentation of their attention weights. Their result in Figure 9 indicates that regardless of the sentence length, both can capture the long-term semantic relation for a sentence that the CNN-based SIL and the BiLSTM-based sequential representative layer cannot obtain. Compared to Normal Attention, Attention Fusion has a better capacity for the accurate and diverse representation of those key information related to entities. For example, except for “” (Peng) as the part of a LOC entity for “Pengcheng County,” more meaningful characters are focused only by Attention Fusion, such as “” (Jiang) and “” (Zhou) as the part of a LOC entity for “Jiangdu” and a PER entity for “Duke of Zhou,” respectively. Also, Attention Fusion pays close attention to the potential feature characters like “” (served_for), “” (warned) and “” (has_a_family_name) which are overlooked by Normal Attention. In fact, they are functional characters that are more likely to appear in the vicinity of an entity, which is quite beneficial to the generalization ability of a NER model. Besides, it turns out to be a more uniform distribution in longer sentences for Normal Attention, which may lead to the failure of the attention mechanism. Conversely, such weakness is not obvious in Attention Fusion.

4.5. Case studies

To show the advantageous performance of our model clearly, we conduct several case studies with two mentioned advanced approaches namely BERT and Stroke-ngram+BERT, and three versions of ours. From Table 11, we find that SEN has the most accurate entity recognition of all categories across different annals. The following example comparison and error analysis can provide some necessary explanations for the advantage.

On the one hand, SEN correctly identifies the potential entities with lower frequency (e.g., the OFI entity “” in SuD and “” in TD) that BERT and Stroke-ngram+BERT erroneously ignore. For BERT, Stroke-ngram+BERT, and “SEN–SIL” that takes no consideration of subword-based integration, there are some wrongly predictive labels, probably due to the ambiguity of non-entity tokens with high frequency in special sentence positions. For instance, according to the syntax of nominal coordinate structures, the character “” (meaning “always” as an adverb here) is easy to be wrongly identified as a PER entity, since it was followed by a conjunction “” (meaning “and”) and a person name “” in HD. But SEN successfully avoids it, possibly because “” has no obvious entity-related radical or structural information which reveals it belongs to a person entity. This provides proof of the effectiveness of SIL that we propose.

On the other hand, those models without the attention fusion mechanism, such as BERT, Stroke-ngram+BERT, and “SEN–AFL” have issues with missing entity labeling. For example, they fail to recognize continuous low-frequency OFI entities like “” and “”, since a high probability of entity label is only assigned to some characters with high frequency which yield some incomplete entities, such as “” and “”. This indicates that the suggested attention fusion of SEN (i.e., AFL) has a superior ability to discover longer-term semantic associations among different Chinese characters than the normal multi-head attention network.

5. Conclusion

In this study, we propose SEN, a new NER model adopting a hybrid end-to-end neural architecture including an input layer, a SIL, a BiLSTM-based sequential layer, an AFL, and a CRF-based output layer. Except for a few common layers that other popular models have, SIL strengthens the character-level representation hierarchically by incorporating the knowledge of BERT, as well as the radical and character structural information, while AFL further captures the diverse global relation semantics among different characters through the fusion of attention networks.

To show the effectiveness of SEN, we first utilize the MARKUS tool to construct a standard NER corpus CMAG based on a great Chinese historical work Comprehensive Mirror for Aid in Government. During the generative process of CMAG, a heuristic reduction algorithm called LEM is developed to solve the problem of nested entities. Moreover, we carry out several experiments on CMAG which indicates that SEN outperforms other popular models in terms of F1-micro and F1-marco, including BERT-based or other subword-based models. Specifically, SEN performs the best on all the given entity types, especially obtaining the largest improvement of performance on OFI. It indicates that SEN is not so sensitive as other DL-based models to the problem of imbalanced classes. Additionally, the correlation analysis shows that the performance of our model varies dramatically across different annals, which is mainly caused by the OFI category with the smallest entity number.

From ablation studies, we learn that despite both SIL and AFL can contribute to the improvement of our model, SIL is more critical. Also, the effectiveness of BERT-based and subword-based blocks in SIL (i.e., BERT-based and subword-based) has been proved, in which the cascaded structure of SIL is an important factor (i.e., the larger the cascade level is, the better the SEN preforms). For AFL, we can easily see the superiority of the proposed attention fusion on the representation of entity semantic information in a global way, compared to normal multi-head attention networks. The visualization for attention weights offers clear evidence and reliable interpretability for the powerfully accurate prediction ability of our attention fusion, which was not discussed in previous NER works about Chinese history.

To the best of our knowledge, this research is one of the earliest pioneering works about subword-enhancing entity extraction in the Chinese historical domain, which effectively integrates the subword information of Chinese characters and global optimization of the attention mechanism. It provides a promising NER solution with state-of-the-art performance, which can also be extended to other digital humanities domains.

Acknowledgments

This research is supported by China Postdoctoral Science Foundation (No. 2021M703564) and National Social Science Foundation of China (No. 18CTQ041).

References

Bingenheimer, M. (2015). The digital archive of Buddhist temple gazetteers and named entity recognition (NER) in classical Chinese. Lingua Sinica 1, 8.CrossRefGoogle Scholar
Bhojanapalli, S., Yun, C., Rawat, A.S., Reddi, S. and Kumar, S. (2020). Low-rank bottleneck in multi-head attention models. In Proceedings of the 37th International Conference on Machine Learning, (ICML), pp. 864873.Google Scholar
Botha, J. and Blunsom, P. (2014). Compositional morphology for word representations and language modelling. In: Proceedings of the 31st International Conference on International Conference on Machine Learning, (ICML), pp. 18991907.Google Scholar
Byrne, K. (2007). Nested named entity recognition in historical archive text. In International Conference on Semantic Computing. Washington D.C.: IEEE Computer Society, pp. 589596.CrossRefGoogle Scholar
Cao, S. and Lu, W. (2017). Improving word embeddings with convolutional feature learning and subword information. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto, California: Association for the Advancement of Artificial Intelligence, pp. 31443151.CrossRefGoogle Scholar
Cao, S., Lu, W., Zhou, J. and Li, X. (2018). Cw2vec: Learning Chinese word embeddings with stroke n-gram information. In Proceedings of 32nd AAAI Conference on Artificial Intelligence. Palo Alto, California: Association for the Advancement of Artificial Intelligence, pp. 50535061.CrossRefGoogle Scholar
Chaudhary, A., Zhou, C., Levin, L., Neubig, G., Mortensen, D.R. and Carbonell, J.G. (2018). Adapting word embeddings to new languages with morphological and phonological subword representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 32853295.CrossRefGoogle Scholar
Chen, A., Peng, F., Shan, R. and Sun, G. (2006). Chinese named entity recognition with conditional probabilistic models. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 173176.Google Scholar
Chiu, J.-P. and Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, 357370.CrossRefGoogle Scholar
Cho, H.-C., Okazaki, N., Miwa, M. and Tsujii, J.I. (2013). Named entity recognition with multiple segment representations. Information Processing and Management 49, 954965.CrossRefGoogle Scholar
Dauphin, Y.N., Fan, A., Auli, M. and Grangier, D. (2017). Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning, (ICML), pp. 933941.Google Scholar
De Weerdt, H. (2020). Creating, linking, and analyzing Chinese and Korean datasets: Digital text annotation in MARKUS and COMPARATIVUS. Journal of Chinese History 4, 519527.CrossRefGoogle Scholar
Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 41714186.Google Scholar
E, S. and Xiang, Y. (2017). Chinese named entity recognition with character-word mixed embedding. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: Association for Computing Machinery, pp. 20552058.CrossRefGoogle Scholar
Forney, G.D. (1973). The viterbi algorithm. Proceedings of the IEEE 61, 268278.CrossRefGoogle Scholar
Gong, C., Li, Z., Xia, Q., Chen, W. and Zhang, M. (2020). Hierarchical LSTM with char-subword-word tree-structure representation for Chinese named entity recognition. Science China Information Sciences 63, 115.CrossRefGoogle Scholar
Gui, T., Ma, R., Zhang, Q., Zhao, L., Jiang, Y.G. and Huang, X. (2019). CNN-based Chinese NER with lexicon rethinking. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. Palo Alto, California: Association for the Advancement of Artificial Intelligence, pp. 49824988.CrossRefGoogle Scholar
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9, 17351780.CrossRefGoogle ScholarPubMed
Jia, Y. and Ma, X. (2019). Attention in character-based BiLSTM-CRF for Chinese named entity recognition. In Proceedings of the 4th International Conference on Mathematics and Artificial Intelligence. New York: Association for Computing Machinery, pp. 14.CrossRefGoogle Scholar
Ji, Z., Shen, Y., Sun, Y., Yu, T. and Wang, X. (2021). C-CLUE: A benchmark of classical Chinese based on a crowdsourcing system for knowledge graph construction. In China Conference on Knowledge Graph and Semantic Computing. Singapore: Springer, pp. 295301.CrossRefGoogle Scholar
Jin, Y., Xie, J., Guo, W., Luo, C., Wu, D. and Wang, R. (2019). LSTM-CRF neural network with gated self attention for Chinese NER. IEEE Access 7, 136694136703.CrossRefGoogle Scholar
Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980.Google Scholar
Lafferty, J.D., McCallum, A. and Pereira, F.C. (2001). Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference on Machine Learning, (ICML), pp. 282289.Google Scholar
Leong, K.S., Wong, F., Li, Y. and Dong, M.C. (2008). Chinese tagging based on maximum entropy model. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing, pp. 138142.Google Scholar
Levow, G.A. (2006). The third international Chinese language processing bakeoff: Word segmentation and named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 108117.Google Scholar
Li, J., Sun, A., Han, J. and Li, C. (2020). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering 345070.CrossRefGoogle Scholar
Li, L., Mao, T., Huang, D. and Yang, Y. (2006). Hybrid models for Chinese named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 7278.Google Scholar
Li, P.-H., Fu, T.-J. and Ma, W.-Y. (2020). Why attention? analyze BiLSTM deficiency and its remedies in the case of NER. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, California: Association for the Advancement of Artificial Intelligence, pp. 82368244.CrossRefGoogle Scholar
Liu, C.-L., Huang, C.-K., Wang, H. and Bol, P.K. (2015). Mining local gazetteers of literary Chinese with CRF and pattern based methods for biographical information in Chinese history. In Proceedings of the 2015 IEEE International Conference on Big Data. Washington D.C.: IEEE Computer Society, pp. 16291638.CrossRefGoogle Scholar
Long, Y., Xiong, D., Lu, Q., Li, M. and Huang, C.R. (2016). Named entity recognition for Chinese novels in the ming-qing dynasties. In Workshop on Chinese Lexical Semantics. Cham, Switzerland: Springer, pp. 362375.CrossRefGoogle Scholar
Luong, M.-T., Socher, R. and Manning, C.D. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of the 17th Conference on Computational Natural Language Learning. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 104113.Google Scholar
Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 10641074.CrossRefGoogle Scholar
Meng, Y., Wu, W., Wang, F., Li, X., Nie, P., Yin, F., Li, M., Han, Q., Sun, X. and Li, J. (2019). Glyce: Glyph-vectors for Chinese character representations. In Proceedings of the 33rd Conference on Neural Information Processing Systems. NeurIPS, pp. 27462757.Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems. NeurIPS, pp. 31113119.Google Scholar
Peng, W., Cheng, H. and Chen, S.-P. (2018). From text to data: Extracting posting data from Chinese local gazetteers. In Proceedings of the 9th International Conference of Digital Archives and Digital Humanities. DADH, pp. 79125.Google Scholar
Pennington, J., Socher, R. and Manning, C.D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp.15321543.CrossRefGoogle Scholar
Sun, Y., Lin, L., Yang, N., Ji, Z. and Wang, X. (2014). Radical-enhanced Chinese character embedding. In Proceedings of the 21st International Conference on Neural Information Processing. Cham, Switzerland: Springer, pp. 279286.CrossRefGoogle Scholar
Tsai, R.T.-H., Wu, S.-H., Lee, C.-W., Shih, C.W. and Hsu, W.L. (2004). Mencius: A Chinese named entity recognizer using the maximum entropy-based hybrid model. International Journal of Computational Linguistics and Chinese Language Processing 9, 6582.Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems. NeurIPS, pp. 59986008.Google Scholar
Watson, R.S. (1986). The named and the nameless: gender and person in Chinese society. American Ethnologist 13, 619631.CrossRefGoogle Scholar
Wu, S., Song, X. and Feng, Z. (2021). MECT: Multi-metadata embedding based cross-transformer for Chinese named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 15291539.CrossRefGoogle Scholar
Wu, X., Zhao, H. and Che, C. (2018). Term translation extraction from historical classics using modern chinese explanation. In International Symposium on Natural Language Processing Based on Naturally Annotated Big Data. Cham, Switzerland: Springer, pp. 8898.CrossRefGoogle Scholar
Xiong, D., Xu, J., Lu, Q. and Lo, F. (2014). Recognition and extraction of honorifics in Chinese diachronic corpora. In Proceedings of the 15th Workshop on Chinese Lexical Semantics. Cham, Switzerland: Springer, pp. 305316.CrossRefGoogle Scholar
Xu, C., Wang, F., Han, J. and Li, C. (2019). Exploiting multiple embeddings for Chinese named entity recognition. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: Association for Computing Machinery, pp. 22692272.CrossRefGoogle Scholar
Yan, C. and Wang, J. (2020). Exploiting hybrid subword information for Chinese historical named entity recognition. In Proceedings of 2020 IEEE International Conference on Big Data. Washington D.C.: IEEE Computer Society, pp. 47954801.CrossRefGoogle Scholar
Yang, F., Zhang, J., Liu, G., Zhou, J., Zhou, C. and Sun, H. (2018). Five-stroke based CNN-BiRNN-CRF network for Chinese named entity recognition. In Proceedings of the 7th CCF International Conference on Natural Language Processing and Chinese Computing. Cham , Switzerland: Springer, pp. 184195.CrossRefGoogle Scholar
Yu, P. and Wang, X. (2021). BERT-based named entity recognition in Chinese Twenty-Four Histories. In International Conference on Web Information Systems and Applications. Cham , Switzerland: Springer, pp. 289301.Google Scholar
Zhang, Y., Liu, Y., Zhu, J., Zhen, Z., Liu, X., Wang, W., Chen, Z. and Zhai, S. (2019). Learning Chinese word embeddings from stroke, structure and Pinyin of characters. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: Association for Computing Machinery, pp. 10111020.CrossRefGoogle Scholar
Zhang, Y., Xu, Z. and Zhang, T. (2008). Fusion of multiple features for Chinese named entity recognition based on CRF model. In Asia Information Retrieval Symposium. Cham, Switzerland: Springer, pp. 95106.CrossRefGoogle Scholar
Zhou, Y., Huang, L., Guo, T., Hu, S. and Han, J. (2019). An attention-based model for joint extraction of entities and relations with implicit entity features. In Proceedings of the 2019 World Wide Web Conference. New York: Association for Computing Machinery, pp. 729737.CrossRefGoogle Scholar
Zhu, Y. and Wang, G. (2019). CAN-NER: Convolutional attention network for Chinese named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 33843393.Google Scholar
Figure 0

Figure 1. Entity information in Sima Guang’s incomplete manuscript of CMAG (owned by the National Library of China). Noted that the official entity of “” (OFI) may have semantic relation with their subword information in each character.

Figure 1

Algorithm 1. Longest Entity Match

Figure 2

Figure 2. The architecture of SEN.

Figure 3

Figure 3. Subword-based neural block.

Figure 4

Figure 4. Hybrid cascaded network.

Figure 5

Table 1. The information of annals in CMAG

Figure 6

Figure 5. The statistic results of SLDCC.

Figure 7

Figure 6. The statistic results of entities by annals.

Figure 8

Table 2. The detailed information about three sets

Figure 9

Table 3. Comparison with CRF-D-M models

Figure 10

Table 4. Comparison with RNN-D-M models

Figure 11

Table 5. Comparison with S-M models

Figure 12

Table 6. Comparison with BERT-M models

Figure 13

Table 7. The F1-score on different entity categories

Figure 14

Figure 7. The F1-score of SEN on CMAG annals.

Figure 15

Table 8. The results of ablation analysis

Figure 16

Figure 8. The influence of cascaded level.

Figure 17

Table 9. The results of ablation analysis

Figure 18

Table 10. The description of random sentence examples

Figure 19

Figure 9. Visualization of attention weights. Note that the darker color means higher weights.

Figure 20

Table 11. The examples of NER on CMAG (green characters are correctly identified by a model, while red ones are erroneous.)