1. Introduction
Chinese sentences consist of consecutive characters without delimiters between words. Therefore, Chinese word segmentation (CWS) is generally considered to be a fundamental and essential task of some Chinese natural language processing (NLP) downstream tasks, such as information extraction, machine translation, and text classification.
Recently, neural network approaches, such as long short-term memory (LSTM) (Chen et al. Reference Chen, Qiu, Zhu, Liu and Huang2015) and Transformer (Tian et al. Reference Tian, Song, Xia, Zhang and Wang2020b; Huang et al. Reference Huang, Yu, Liu, Liu, Cao and Huang2021), have proven effective in CWS. Various methods have been proposed to integrate contextual information and external resources into neural networks, for example, Margatina et al. (Reference Margatina, Baziotis and Potamianos2019), Liu et al. (Reference Liu, Wu, Wu, Huang, Xie, Zhang, Ng, Zhao, Li and Zan2018), and Tian et al. (Reference Tian, Song, Ao, Xia, Quan, Zhang and Wang2020a) incorporated lexicon and syntax knowledge for CWS using attention mechanisms and a multi-task framework.
As is well known, dependency trees provide word-to-word dependency information, offering a solution to alleviate the challenge of segmentation ambiguity. For instance, consider the input sentence depicted in Fig. 1, which can be segmented as “ ” or “ ”. The dependency tree captures the syntactic relations in this sentence, revealing that “” depends on “”. This dependency relation helps mitigate ambiguity in the highlighted portion of the sentence. As mentioned by Huang and Xue (Reference Huang and Xue2012), besides ambiguity segmentation, the out-of-vocabulary (OOV) problem is another issue that needs to be concerned with CWS. Many studies have used lexicon and n-gram lexicon as features to alleviate the OOV problem. While some studies have attempted to leverage both dependency tree information and lexicons (Tian et al., 020a), they often overlook the heterogeneous nature of these features and the distinctive structures inherent in dependency trees. This oversight may be attributed to the absence of direct and successful methods for fully integrating heterogeneous linguistic features (e.g., lexicon, n-gram, syntax) and their structural information into neural networks.
Therefore, in this paper, we aim to explore appropriate methods that can integrate heterogeneous external knowledge and their structures for CWS. Taking inspiration from the work of Nie et al. (Reference Nie, Zhang, Peng and Yang2022) and Sui et al. (Reference Sui, Chen, Liu, Zhao and Liu2019), they employed graph neural networks (GNNs) to encode heterogeneous features for Chinese named entity recognition (NER), we introduce a heterogeneous feature learning framework for CWS called HGNSeg. Specifically, we build heterogeneous linguistic features (e.g., lexicon, n-gram lexicon, and dependency tree) and their structural information as a heterogeneous graph, where nodes integrate different types of features based on different types of edges absorbing information from neighboring nodes. We represent all characters in the input sequence, the candidate words from the lexicon, and n-grams from the n-gram lexicon as nodes and then add different types of edges between nodes. Characters establish connections with each other based on the dependency relation, and the characters link to words and n-grams in accordance with their relative positions within these linguistic units.
To summarize, our contributions are as follows:
-
• We introduce a novel framework for CWS utilizing graph convolutional networks (GCNs) to encode heterogeneous features.
-
• We employ GCNs to encode a variety of heterogeneous linguistic features, including then lexicon, the n-gram lexicon, and the dependency tree. Our design incorporates various types of nodes and edges to seamlessly integrate these features and their respective structural information.
-
• We evaluate HGNSeg on six benchmark CWS datasets and six cross-domain CWS datasets. Experimental results on benchmark datasets show that HGNSeg can improve the performance of CWS by integrating heterogeneous linguistic features. Experimental results on cross-domain corpora also confirm that HGNSeg effectively alleviates the OOV issue in the cross-domain scenario.
2. Related work
2.1 Chinese word segmentation
CWS is a typical task in Chinese language processing tasks. There are various modeling approaches for the CWS task, for example, early lexicon-based matching methods and statistical methods. For example, Huang (Reference Huang1995) and Sproat and Shih (Reference Sproat and Shih1990) used mutual information to measure the binding strength in a string. Subsequently, Xue (Reference Xue2003) first modeled CWS as a sequence labeling task, assigning each character in the input sentence one of the four tags–LL, RR, MM, or LR–based on its position within a word. Huang et al. (Reference Huang, Šimon, Hsieh and Prévot2007) proposed to model CWS as a unary word boundary decision task, which simplifies the complexity of CWS and offers a promising solution for addressing issues related to domain adaptation and robustness. All of these methods perform well and have advantages in improving ambiguous segmentation and OOV problems. However, all of them require a large-scale labeled dataset for support. Therefore, based on Huang’s work, Li et al. (Reference Li, Zhou and Huang2012) further proposed to use active learning to address the need for labeled data. By selecting the most informative boundaries for manual labeling and using automatic labeling for less informative boundaries, this method reduces the cost of manual labeling.
Recently, numerous neural network approaches have demonstrated remarkable success in CWS, eliminating the need for intricate feature engineering. For instance, Liu et al. (Reference Liu, Wu, Wu, Huang, Xie, Zhang, Ng, Zhao, Li and Zan2018) employed a convolutional neural network (CNN) to capture character features. LSTM networks, known for their capability to learn long-distance dependencies in sentences, were utilized by Chen et al. (Reference Chen, Qiu, Zhu, Liu and Huang2015), Zhang et al. (Reference Zhang, Zhang and Fu2016), Ma et al. (Reference Ma, Ganchev and Weiss2018), and Margatina et al. (Reference Margatina, Baziotis and Potamianos2019) to extract contextual features for CWS. The Transformer, incorporating the self-attention mechanism, surpasses LSTM in capturing long-distance dependencies and parallel computing. Qiu et al. (Reference Qiu, Pei, Yan and Huang2020) employed the Transformer as an encoder to extract context-aware information for multi-criteria CWS. Notably, pretrained language model (PLM) has recently shown exceptional performance in CWS, with Tian et al. (Reference Tian, Song, Xia, Zhang and Wang2020b) and Huang et al. (Reference Huang, Yu, Liu, Liu, Cao and Huang2021) leveraging BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), Zen (Diao et al. Reference Diao, Bai, Song, Zhang and Wang2019), and RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019) to learn text representations for CWS without extensive feature engineering.
Most studies further enhance neural models by integrating external knowledge (e.g., lexicon, syntax) and learning contextual information (e.g., n-grams). CWS aims to find out the boundaries of words in sentences, so the lexicon is one type of effective external resource for CWS. Margatina et al. (Reference Margatina, Baziotis and Potamianos2019), for example, introduced an attention mechanism on LSTM to integrate word-level information for CWS. For cross-domain CWS, challenges such as a scarcity of annotated data and the presence of OOV instances are formidable. Ding et al. (Reference Ding, Long, Xu, Zhu, Xie, Wang and Zheng2020) addressed these issues by leveraging automatic CWS toolkits, mutual information methods, term frequency–inverse document frequency, and other techniques to automatically construct a lexicon for the target domain. This lexicon, along with the maximum matching algorithm, facilitated the segmentation of target domain data to generate labeled data. Adversarial training was then employed to minimize the dissimilarity between the source and target domain representations. In a different approach, Zhang et al. (Reference Zhang, Liu and Fu2018) integrated lexicon by constructing feature vectors for each character with several predefined templates. Meanwhile, Liu et al. (Reference Liu, Fu, Zhang and Xiao2021) used external lexicon knowledge to enhance BERT by a lexicon adapter layer, demonstrating robust performance across CWS, part-of-speech (POS) tagging, and NER tasks.
N-grams represent a rich contextual resource for CWS (Kurita, Kawahara, and Kurohashi, Reference Kurita, Kawahara and Kurohashi2017; Shao et al. Reference Shao, Hardmeier, Tiedemann and Nivre2017; Tian et al. Reference Tian, Song, Ao, Xia, Quan, Zhang and Wang2020a). In a notable example, Tian et al. (Reference Tian, Song, Xia, Zhang and Wang2020b) integrated wordwood information into a neural segmenter using key-value networks, achieving state-of-the-art (SOTA) performance at the time. Furthermore, the outcomes of CWS are frequently leveraged to enhance syntax parsing (Shen et al. Reference Shen, Tan, Sordoni, Li, Zhou and Courville2022). Intuitively, incorporating syntax knowledge can also improve CWS models. Tian et al. (Reference Tian, Song, Ao, Xia, Quan, Zhang and Wang2020a), for instance, utilized syntax knowledge generated by existing NLP toolkits with an attention mechanism to enhance the joint CWS and POS tagging task.
While these methods yield satisfactory performance, they often overlook the heterogeneity of knowledge. Combining multiple types of knowledge requires distinct methodologies for each type. Furthermore, these approaches tend to neglect the structural information inherent in knowledge, such as the dependencies present in the structure of dependency trees.
2.2 Graph convolutional networks
GNNs have received significant attention as a novel class of neural networks (Defferrard, Bresson, and Vandergheynst, Reference Defferrard, Bresson and Vandergheynst2017; Kipf and Welling, Reference Kipf and Welling2017). These networks perform computations on graphs and possess the capability to retain the global structural information of the graph within its embedding, making them particularly effective for graphs with rich relational structures (Yao, Mao, and Luo, Reference Yao, Mao and Luo2018).
GCNs, a subset of GNNs, operate similarly to CNN, with the distinction that GCNs utilize convolution operations on graphs (Kipf and Welling, Reference Kipf and Welling2017). In GCNs, each node in the graph updates its embedding by incorporating relevant information from its neighboring nodes. This mechanism has found success in various NLP tasks for encoding external knowledge with intricate structures. For example, Hu et al. (Reference Hu, Yang, Zhang, Zhong, Tang, Shi, Duan and Zhou2021) employed GCNs to integrate knowledge base information, enhancing fake news detection. In text classification tasks, Michael et al. (Reference Michael, Xavier and Pierre2016) demonstrated that GCNs outperform traditional CNN models. Additionally, GCNs have been utilized to encode syntax graphs, improving semantic role labeling models (Marcheggiani and Titov, Reference Marcheggiani and Titov2017) and to integrate syntax information into neural machine translation (NMT) models, enriching word-level representations (Bastings et al. Reference Bastings, Titov, Aziz, Marcheggiani and Simaan2017).
A heterogeneous graph encompasses nodes and edges of various types. For instance, Hu et al. (Reference Hu, Yang, Shi, Ji and Li2019) transformed relation structures and additional features (e.g., topics, entities) between short texts into heterogeneous graphs. They employed a graph attention network (GAT) to extract heterogeneous information, strengthening their text classification model. Similarly, Nie et al. (Reference Nie, Zhang, Peng and Yang2022) constructed multi-granularity lexicons as heterogeneous graphs, utilizing GCNs to encode them for NER. Intuitively, incorporating heterogeneous features into CWS using heterogeneous graphs seems promising. These works prove that GCNs are an effective method for integrating linguistic features and their structure information, as long as they can be represented as graphs (Bastings et al. Reference Bastings, Titov, Aziz, Marcheggiani and Simaan2017).
GCNs have been applied in the CWS task. Zhao et al. (Reference Zhao, Zhang, Liu and Fei2020) utilized GCNs to encode multi-granularity structural information, enhancing the performance of the joint CWS and POS tagging task. In the domain of electronic medical record text segmentation, Du et al. (Reference Du, Mi and Du2020) employed a GNN to learn local structural features based on domain lexicons. Additionally, Huang et al. (Reference Huang, Yu, Liu, Liu, Cao and Huang2021) incorporated lexicons into a PLM using GCNs, improving CWS performance and enhancing the robustness of cross-domain CWS. Notably, Yu et al. (Reference Yu, Huang, Wang and Huang2022) proposed a lexicon-augmented GCN for cross-domain CWS, encoding candidate words’ boundary information to enhance the CWS model.
However, existing CWS studies have not fully capitalized on the fact that GCNs can encode structural information. They have also overlooked structural features related to syntax and other linguistic elements. To harness the full potential of heterogeneous features for CWS, we propose converting linguistic features into heterogeneous graphs and utilizing GCNs for encoding.
3. Methodology
In this section, we introduce the workflow of HGNSeg, including the process of constructing the heterogeneous graph and how to implement word segmentation using GCNs. Specifically, we first use the dataset and multiple features to build a heterogeneous graph for each sample and then initialize all nodes with different encoders. Next, we divide the heterogeneous graph into several sub-graphs according to different edges. The graph convolution operation is performed on each sub-graph, respectively. Then the information from each sub-graph is aggregated, and we will delve into the details of this process in Section 3.2.2. Finally, we concatenate the outputs of GCNs and the outputs of the pretrained model and then feed them into the decoder to get the final label sequence. Fig. 1 shows the overall framework of HGNSeg.
CWS is generally regarded as a character-based sequence labeling task. For a given Chinese sentence $ X\ = \{x_1,\ x_2,\ldots, x_T\}$ , each character in the sequence is marked as a label in the label set, $\ \mathscr{L}\ =\{B, M, E, S\}$ , where “B” indicates the character is the beginning of a word, “M” means the middle of a word, “E” represents character at the end of a word, and “S” denotes the single-character word. The purpose six CWS is to find the optimal label sequence $Y^* = \{y_1^*, y_2^*, \ldots, y_{T}^*\}$ :
3.1 Encoding layer
According to Ma et al. (Reference Ma, Ganchev and Weiss2018), the main factor responsible for the disparity of CWS is insufficient training rather than lousy training. Therefore, we use BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) and RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019) as the encoders, which are pretrained with a huge amount of unlabeled Chinese data:
where $\textbf{e}_t$ is the representation for $x_t$ from the encoder.
3.2 Heterogeneous graph convolutional network
3.2.1 Graph construction
Encouraged by some previous studies using GCNs for CWS (Zhao et al. Reference Zhao, Zhang, Liu and Fei2020; Du et al. Reference Du, Mi and Du2020), we construct the graph by representing characters from the corpus, words from a lexicon $\mathscr{L}$ , and n-grams from a n-gram lexicon $\mathscr{N}$ as graph nodes. The method of constructing the lexicon and the n-gram lexicon is introduced in Section 3.4. Then two types of edges are defined according to different relations.
Syntax Edges. Marcheggiani and Titov (Reference Marcheggiani and Titov2017) introduced GCNs designed to operate on graphs with directional edges and labels, enabling the encoding of linguistic structures like dependency trees. Their approach incorporated edge gates, allowing the model to dynamically adjust the weight of each edge.
Inspired by their methodology, to enhance character representations, when encoding dependency trees, we do not use words as nodes, but the characters that form the words as nodes, and we hope that the information of dependencies can be used to enhance the character representations directly, instead of forming the word representations first, and then enhancing the word representations through the word representations. The word-level information will be augmented by the subsequent character-word/n-grams graph.
Therefore, edges connecting characters represent syntax relations between each character pair. Following the principles outlined by Marcheggiani and Titov (Reference Marcheggiani and Titov2017), we maintain the head-dependent relationship in a similar manner to a dependency tree. Specifically, we preserve the directional information: the outgoing edge signifies the connection from head to dependent, while the incoming edge represents the connection from dependent to head.
We use an example to show our idea for the input sentence in Fig. 1. “” is a head points to the dependent “”, therefore “” and “” have two outgoing edges to “” and “”, respectively, “” and “” also have two incoming edges from “” and “”, respectively. We only convert relations parsed by the toolkit, for example, there is no direct syntactic relation between “” and “”, so there is no edge connection between them in the graph.
Character-Word/N-gram Edges. The primary objective of CWS is to determine the optimal segmentation positions for the input sentence. However, each character $x_i$ may assume distinct roles within different words. For instance, $x_i$ might serve as the beginning, middle, or end of a word, or it may represent a single-character word. These diverse positions convey different contextual information. To effectively capture the local context surrounding $x_i$ and identify candidate words, the edges connecting $x_i$ to words/n-grams in the sub-graph signify whether $x_i$ marks the beginning or the end of words/n-grams. The word/n-gram refers to a span within the sentence that can be matched with entries in the lexicon $\mathscr{L}$ or the n-gram lexicon $\mathscr{N}$ . In this type of sub-graph, we establish connections between two adjacent characters to preserve the sequential order within the sequence.
As shown in Fig. 1, since “” serves as the end character of the word “” and the beginning character of the word “”, it is connected to both “” and “”. Both “” and “” are split from the input sentence and matched in the lexicon $\mathscr{L}$ . Additionally, since “” serves as the beginning character of the 4-gram “”, it is also connected to “” belonging to the n-gram lexicon $\mathscr{N}$ . Importantly, these two types of edges are unidirectional. When a span exists in both the lexicon and the n-gram lexicon, only one of them is retained in the graph.
3.2.2 Graph convolutional network
After building the heterogeneous graph, we convert the character nodes, the word nodes, and the n-gram nodes into embeddings. These embeddings are trainable and subject to refinement during the training process. Subsequently, we apply the graph convolutional operation on the two sub-graphs, followed by the aggregation of information to update the higher-order embedding of each node.
Given a graph $\mathscr{G} = (\mathscr{V},\mathscr{E}, A)$ , in which $\mathscr{V}$ and $\mathscr{E}$ denote the sets of nodes and edges in the heterogeneous graph, respectively. The adjacency matrix is represented by $A$ , where if there is an edge between the $i_{th}$ node and the $j_{th}$ node, then $A_{ij}=1$ . Let $H^0\in \mathbb{R}^{\mid{\mathscr{V}}\times d\mid }$ be a feature matrix represents the nodes embeddings, with $h_v \in \mathbb{R}^d$ (each row $h_v$ serving as the embedding for node $v$ ). We introduce self-connections and a degree matrix $D$ for the adjacency matrix $A$ as follows:
where $D_{ii} = \sum _{j}A_{ij}'$ . Then the calculation rule for each layer is as follows:
where $\sigma (\cdot )$ represents an activation operation like rectified linear unit (ReLU) and $\tilde{A}$ denotes the symmetric normalized adjacency matrix. $H^{(l)} \in \mathbb{R}^{\mid{\mathscr{V}}\times d\mid }$ is the nodes hidden states of $l^{th}$ layer and $W^{(l)}$ is a trainable parameter of the $l^{th}$ layer in the graph.
In our heterogeneous graph, there are different types of edges $\mathscr{T} = \{\tau _{dep.} \tau _{lex.}\}$ , where $\tau _{dep.}$ denotes the syntax edges and $\tau _{lex.}$ represents the character-word/n-gram edges. We update $H^{(l+1)}$ , which is the representations of nodes in the $(l+1)^{th}$ layer, by summing the embeddings of their neighborhood $H_{\tau }^{(l)}$ with two types of edges: $\tau$ .
where $\sigma (\cdot )$ denotes a nonlinear activation function. $\tilde{A}_{\tau }$ represents a symmetric normalized adjacency sub-matrix that exclusively contains one type of edge, denoted as $\tau$ . $H_{\tau }^{(l)}$ denotes an embedding matrix of the adjacent nodes for each node connected by edge type $\tau$ . $W_{\tau }^{(l)}$ is a trainable weight matrix. Initially, $H_{\tau }^{(0)}$ corresponds to the node features obtained from the pretrained model or random initialization.
3.2.3 Edge-wise gating
As our method depends on automatic toolkits for parsing dependency trees, their performance may not be perfect. Consequently, absorbing equal information from all adjacent nodes may not be suitable, as downstream applications relying on potentially incorrect syntax edges can lead to error propagation. Moreover, each character receives information from multiple candidate words, yet not every candidate word represents a reasonable segmentation for the input sentence. Therefore, it becomes necessary to down-weight the corresponding nodes in the graph.
To address these challenges, drawing inspiration from recent endeavors (van den Oord et al. Reference Van den Oord, Kalchbrenner, Vinyals, Espeholt, Graves and Kavukcuoglu2016; Marcheggiani and Titov, Reference Marcheggiani and Titov2017; Dauphin et al. Reference Dauphin, Fan, Auli and Grangier2017), we introduce a scalar gate for each node pair, defined as follows:
where $\theta$ represents the logistic sigmoid function, and $W_{\tau }^{(l),g} \in \mathbb{R}^{d}$ and $ b_{\tau }^{(l),g} \in \mathbb{R}$ are a weight and a bias for the gate, respectively. Therefore, the final heterogeneous graph CNN calculation is formulated as follows:
3.3 Decoding layer
Various decoding algorithms can be implemented, including the softmax layer and conditional random fields (CRF) (Lafferty, McCallum, and Pereira, Reference Lafferty, McCallum and Pereira2001). As reported by Tian et al. (Reference Tian, Song, Xia, Zhang and Wang2020b), CRF demonstrates superior performance compared to softmax in CWS tasks. Consequently, in our framework, we choose CRF as the decoder.
In the CRF decoding layer, $P(Y \mid X )$ in Eq. (1) could be represented as:
where $\ \emptyset (Y \mid X)$ is the potential function. CRF is concerned with the relation between two consecutive labels, $y_{i-1}, y_{i}$ :
where $b_{y_{i-1} y_{i}} \in \mathbb{\textbf{R}}$ is the trainable bias term corresponding to the label pair $(y_{i-1},\ y_{i})$ . Finally, the score for each label of the $i_{th}$ character is calculated by the score function $s(X,i)\in \mathbb{R}^{ \mid \mathscr{L}\mid }$ :
where $a_i = e_i \bigoplus h_i$ , which is generated by concatenating the outputs of the encoder $\textit{E}$ and the outputs of the graph network $H^{(l+1)}$ . $W_s\in \mathbb{R}^{d_a\times L}$ and $b_s\in \mathbb{R}^{ \mid \mathscr{L}\mid }$ are trainable parameters.
3.4 Lexicon and N-gram lexicon construction
To construct the character-word/n-gram sub-graph, the first thing we need to do is build the word lexicon $\mathscr{L}$ and the n-gram lexicon $\mathscr{N}$ , since the edges between characters and words/n-grams are built on them. In our work, $\mathscr{L}$ consists of words from the training set. For $\mathscr{N}$ , we use accessor variety (AV) to extract high-frequency n-grams from each corpus. Access variation (AV) is an unsupervised method proposed by Feng et al. (Reference Feng, Chen, Deng and Zheng2004) for extracting words from a Chinese text corpus. In their approach, they attribute significance to both the characters directly preceding the string (leading characters) and those immediately following it (trailing characters) as crucial factors in determining string independence. Through experiments conducted on various corpora, they confirmed that the method based on accessor variety and adhesive characters performs efficiently in fulfilling word extraction tasks. Subsequently, this method has been frequently employed for the extraction of n-grams. The kind of accessor for a multi-character $s$ is defined as follows:
where $L_{av}(s)$ is called the left AV, which represents the number of distinct character types that can precede n-gram $s$ (predecessors). And $R_{av}(s)$ denotes the number of distinct character types that can follow n-gram $s$ (successors). When the AV value is not less than a predefined threshold, it means that these strings should appear in enough different contexts to be considered meaningful.
4. Experiment
4.1 Datasets and experimental setup
We conduct experiments on six CWS benchmarks to assess the effectiveness of the proposed model. These datasets include CITYU, PKU, MSR, and AS from SIGHAN 2005 (Emerson, Reference Emerson2005) and CTB and SXU from SIGHAN 2008 (Jin and Chen, Reference Jin and Chen2008). The statistics of six datasets are listed in Table 1. For SIGHAN 2005, we generate a development set for model tuning by randomly selecting 10% data from the training set. AS and CITYU are two special datasets, and their data are in traditional Chinese, so we convert them to simplified Chinese.
We also evaluate the proposed model on six cross-domain corpora, including two Chinese fantasy novel datasets: DL (DouLuoDaLu) and ZX (ZhuXian) (Qiu and Zhang, Reference Qiu and Zhang2015), one medicine dataset: DM (dermatology), a patent dataset: PT (Ye et al. Reference Ye, Zhang, Li, Qiu and Sun2019), and a literature dataset (Lit) and a computer dataset (Com) form SIGHAN 2010. The statistics of six datasets are listed in Table 2. We use the news corpus PKU training set as the source domain training set of the cross-domain experiments, which has different text types from the four target datasets. For PT, DL, and DM, we also use the PKU development set as the development set.
Following the previous paper (Huang et al. Reference Huang, Huang, Liu and Mo2020), we replace all Latin letters, punctuation, and digits with a unique token. In our experiments, we use the development set to select the best model.
For the encoders, we follow the default setting of the BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) and the RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019). For GCNs, we represent all the character nodes with the PLM, and we randomly initialize word nodes and n-gram nodes. To obtain the dependency trees as mentioned, we use two automatic toolkits, Stanford CoreNLP Toolkit (SCT)Footnote a (Manning et al. Reference Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky2014) and LTP 4.0 (Che et al. Reference Che, Feng, Qin and Liu2021)Footnote b. Stanford CoreNLP is a NLP toolkit developed by Stanford University. Implemented in Java, it offers fundamental language analysis functions, including tokenization, POS tagging, NER, dependency parsing, sentiment analysis, and coreference parsing. Currently, the toolkit supports nine languages, such as English, Chinese, and Italian. LTP 4.0 is an open-source neuro-linguistic technology platform supporting six essential Chinese NLP tasks. These tasks encompass lexical analysis, including CWS, POS tagging, and NER, as well as syntactic analysis involving dependency analysis. Moreover, LTP 4.0 extends to semantic analysis, covering semantic dependency analysis and semantic annotation. For convolutional operation, we use the two-layer GCNs. The key hyperparameter settings are listed in Table 3.
4.2 Results on benchmark datasets
Table 4 presents the experimental results on the six benchmark corpora, where the reported F1 scores and $R_{oov}$ values are averages over three experimental runs. The results are provided using different encoders, and comparisons are made with some previous SOTA works, including both single-criterion models (Zhang and Fu, Reference Zhang and Fu2016; Tian et al. Reference Tian, Song, Xia, Zhang and Wang2020b) and multi-criteria models (Chen et al. Reference Chen, Shi, Qiu and Huang2017; He et al. Reference He, Wu, Yan, Gao, Feng and Townsend2018; Qiu et al. Reference Qiu, Pei, Yan and Huang2020; Huang et al. Reference Huang, Huang, Liu and Mo2020). The evaluated models and approaches are as follows:
-
• Three existing CWS tools: LPT4.0, Stanford CoreNLP, and THULAC (Li and Sun, Reference Li and Sun2009). LTP 4.0 utilized a PLM (ELECTRA) (Clark et al. Reference Clark, Luong, Le and Manning2020) as an encoder, trained on the People’s Daily Corpus dataset. Stanford CoreNLP employed CRF as the CWS model and was trained with the PKU dataset. THULAC is an efficient Chinese lexical analysis tool, with the CWS model trained based on the People’s Daily corpus.
-
• Zhang and Fu (Reference Zhang and Fu2016) incorporated word embeddings into recurrent neural network to improve CWS performance.
-
• Chen et al. (Reference Chen, Shi, Qiu and Huang2017) combined adversarial learning and Bi-LSTM to learn the common knowledge for multi-criteria CWS.
-
• He et al. (Reference He, Wu, Yan, Gao, Feng and Townsend2018) used two special tokens at the beginning and end of the input sentence to mark target criteria and then employed LSTM + CRF as the shared encoder for multi-criteria CWS.
-
• Qiu et al. (Reference Qiu, Pei, Yan and Huang2020) employed the Transformer as the encoder to extract the criteria-aware contextual information for multi-criteria CWS.
-
• Huang et al. (Reference Huang, Huang, Liu and Mo2020) retained special markers and replaced the LSTM + CRF backbone model used by He et al. (Reference He, Wu, Yan, Gao, Feng and Townsend2018) with RoBERTa+CRF for multi-criteria CWS.
-
• Tian et al. (Reference Tian, Song, Xia, Zhang and Wang2020b) incorporated wordwood information for the neural segmenter using key-value networks.
-
• Huang et al. (Reference Huang, Yu, Liu, Liu, Cao and Huang2021) incorporated the lexicon into the PLM with GCNs to improve the performance of CWS and strengthen the robustness of cross-domain CWS.
-
• Liu et al. (Reference Liu, Fu, Zhang and Xiao2021) used the external lexicon to enhance BERT through a lexicon adapter layer.
Firstly, in comparing our model with existing CWS tools, it is evident that although we utilized LTP 4.0 and Stanford CoreNLP as analysis tools for dependent syntax in our experiments, these off-the-shelf tools prove less effective in CWS due to not being trained with corresponding datasets. The large gap between the experimental results of these tools and ours suggests that our improvements do not depend on additional CWS tools.
Subsequently, we validate the effectiveness of HGNseg by comparing results across different encoders with and without the HGN module. As depicted in Table 4, models incorporating the HGN module consistently outperform their baseline counterparts without the HGN module in terms of F1 value and $R_{oov}$ across the six datasets. Additionally, we observe a noteworthy improvement in the average $R_{oov}$ for RoBERTa-CRF with HGN compared to RoBERTa-CRF, with nearly a 1% enhancement across almost all datasets. These findings indicate that the HGN module contributes to enhancing both segmentation performance and $R_{oov}$ .
The improvement in F1 value achieved by pretrained encoders is consistently notable across different models. When compared with models (Zhang and Fu, Reference Zhang and Fu2016; Chen et al. Reference Chen, Shi, Qiu and Huang2017) utilizing Bi-LSTM as the encoder, our models show an approximate 2% increase in F1 value, and $R_{oov}$ sees an improvement of approximately 10% when RoBERTa or BERT is employed as the encoder. This improvement can be attributed to the external knowledge provided by the pretraining process in the case of CWS.
In comparison to multi-criteria scenarios implemented in models (Chen et al. Reference Chen, Shi, Qiu and Huang2017; He et al. Reference He, Wu, Yan, Gao, Feng and Townsend2018; Qiu et al. Reference Qiu, Pei, Yan and Huang2020; Huang et al. Reference Huang, Huang, Liu and Mo2020), despite these models being trained with more extensive datasets, our model consistently outperforms them in terms of both F1 scores and $R_{oov}$ . In summary, our model effectively mitigates the OOV issue through the integration of heterogeneous linguistic information with HGN.
Observing the results, it is evident that the approaches presented by Huang et al. (Reference Huang, Yu, Liu, Liu, Cao and Huang2021) and Liu et al. (Reference Liu, Fu, Zhang and Xiao2021) surpass our model on PKU and MSR datasets. A potential explanation for this could be that these models leverage external lexicons, such as the Jieba lexicon. This suggests that the model’s performance improves when incorporating a more extensive lexicon. As part of future work, we plan to augment the lexicon by integrating additional external lexicon sources.
4.3 Cross-domain performance
To evaluate the cross-domain ability of HGNSeg, we conduct a comparative analysis with previous cross-domain models, including:
-
• Three existing word segmentation tools: LPT4.0, Stanford CoreNLP, and THULAC (Li and Sun, Reference Li and Sun2009).
-
• The work by Liu et al. (Reference Liu, Zhang, Che, Liu and Wu2014), involved training CRF on both fully and partially annotated data to enhance cross-domain CWS.
-
• The approach by Zhang et al. (Reference Zhang, Liu and Fu2018), which integrated domain-specific dictionaries to address the OOV issue in cross-domain CWS.
-
• The model proposed by Ye et al. (Reference Ye, Zhang, Li, Qiu and Sun2019), which trained word embeddings on the target domain corpus, leading to performance improvements in cross-domain CWS.
-
• The strategy presented by Ding et al. (Reference Ding, Long, Xu, Zhu, Xie, Wang and Zheng2020), which constructed a target domain lexicon, utilized it to segment the target domain data for labeled data acquisition, and subsequently employed adversarial training to minimize the dissimilarity between the source and target domain representations.
-
• We also use RoBERTa + CRF and the classical sequence annotation task model RoBERTa + Bi-LSTM + CRF as baselines.
The main results are reported in Table 5. We use PKU training set words and n-grams from each cross-domain corpus to construct character-word/n-grams subgraphs.
Analyzing the results in Table 5, our model compares favorably to the other baseline models, achieving optimal results on most datasets.
However, the performance of our model is less satisfactory on the two fantasy novel datasets and the literature dataset. The method proposed by Ding et al. (Reference Ding, Long, Xu, Zhu, Xie, Wang and Zheng2020) demonstrates superior results on these novel datasets, suggesting that the lexicon specific to the target domain is a key factor in effective cross-domain CWS. Another factor contributing to this discrepancy could be the substantial differences between novel and news text. Novels and news articles often have different writing styles, vocabulary, and structures, which could impact the performance of algorithms or models trained on one type of text when applied to the other.
Specifically, compared to the four previous works, our model exhibits an average 5% improvement in F1 values for DM and PT. Additionally, the inclusion of HGN in our model significantly enhances $R_{oov}$ on all six datasets compared to using RoBERTa alone, with improvements of 1.49% (DM), 2.96% (PT), 1.7% (DL), 2.71% (ZX), 5.62%(Com), and 0.64% (Lit) respectively.
As we know, the OOV problem is the main challenge of cross-domain CWS. Segmenters trained in the newswire domain are limited to segmenting domain-specific words from other domains. We also investigate the influence of the domain-specific n-gram lexicon size on $R_{oov}$ . We randomly selected 20%, 40%, 60%, and 80% of the n-grams from each target corpus to construct new n-gram lexicons with varying proportions. As illustrated in Fig. 2, the $R_{oov}$ values of HGNSeg consistently increase with lexicon expansion. Therefore, we can reasonably infer that HGNSeg is likely to achieve better performance when utilizing a lexicon containing a more extensive set of domain-specific n-grams.
In Fig. 2, the most substantial improvement in $R_{oov}$ is observed for PT. Upon analyzing cases in the PT test set, we found that our model can recognize some domain-specific entity words. As illustrated in Table 6, for the first sentence, the model segments “” as two words when not using n-grams from PT. In contrast, the model correctly segments this word when utilizing the n-gram lexicon from PT. Similarly, for the second sentence, “” represents a Chinese herbal medicine, constituting a domain-specific word in PT. The model, equipped with the n-gram lexicon, correctly recognizes this word. Several other examples, such as “”, “”, and “”, further support the effectiveness of domain-specific n-grams in improving $R_{oov}$ in the cross-domain scenario.
4.4 Ablation study
This section discusses the validity of each element in the heterogeneous GCNs through ablation experiments, and the results are presented in Table 7.
The first ablation experiment confirms the effectiveness of the character-word/n-gram sub-graph. The comparison between the first and the third lines in Table 7 shows an average improvement of about 1% in $R_{oov}$ values on the six datasets, with PKU and MSR being more sensitive to this type of sub-graph, exhibiting improvements of 3% and 1.62%, respectively.
The second ablation study evaluates the influence of the dependency tree sub-graph. In this experiment, the performance of HGNSeg is tested by removing the dependency tree sub-graph from the heterogeneous graph. As observed in the second line of Table 7, although this type of sub-graph also contributes to improvements in overall F1 and $R_{oov}$ values, it appears that the character-word/n-gram sub-graph has a more significant impact on the majority of datasets.
4.5 Effect of different lexicons
In this section, we also analyze the impact of different lexicons in the character-word/n-gram sub-graph, specifically, the use of lexicons $\mathscr{L}$ and $\mathscr{N}$ . The experimental results with different lexicons are presented in Table 8, indicating varying degrees of impact on different datasets. Notably, for the PKU dataset, $\mathscr{L}$ appears to be more important, whereas for the AS dataset, $\mathscr{N}$ is deemed more crucial. In general, both the lexicon from the training set and the n-gram lexicon have a significant positive impact, considerably boosting the performance.
4.6 Effect of different syntax parsing toolkits
Through the preceding analysis, our model effectively incorporates syntax information. However, existing syntax parsing toolkits are not flawless, and parsing mistakes are noticeable, particularly when applied to long sentences. To compare the effects of two dependency syntax parsing toolkits, we present histograms of the F1 and $R_{oov}$ values obtained from HGNSeg with different parsing toolkits on six datasets (yellow bars for SCT, green bars for LTP 4.0) in Fig. 3.
As illustrated in Fig. 3a and b, LTP 4.0 proves to be more suitable for HGNSeg, and the models using LTP 4.0 exhibit superior performance on all six datasets compared to those using SCT. Despite SCT providing rich dependency syntax labeling information, our model only utilizes the head and dependent information. Consequently, the dependency syntax parsing results from LTP 4.0 suffice to meet the needs of the CWS task in our framework. In future work, we will explore the integration of additional dependency labeling features from SCT to further enhance CWS.
As depicted in Fig. 3a, there is a substantial disparity in the F1 values on the AS dataset when using LTP 4.0 and SCT. To further investigate, we analyze specific examples from the AS test set, as presented in Table 9.
For the first sentence, “ (Items include train doors opening, platform parking offside, etc.)”, although “” could be considered a valid segmentation, segmenting this sequence into two words, “” and “”, aligns more closely with the word segmentation criteria of the AS dataset. The dependency tree parsing results in Fig. 4 highlight the evident difference between LTP 4.0 and SCT, leading to distinct segmentation outcomes. This observation reinforces the effectiveness of our model in incorporating dependency tree parsing results.
In the fourth example, a person’s name is treated as a single word in AS, whereas the model with SCT segments it into two words.
The model utilizing SCT often produces noticeable errors. For instance, in the third sentence in Fig. 4, it segments “” as “”, while the correct gold segmentation is “”. Further analysis reveals that some of these errors stem from incorrect dependency parsing results, as observed in the second and fifth sentences. Hence, the accuracy of the syntax parsing toolkit significantly impacts our model’s performance.
In Fig. 3b, the $R_{oov}$ values across all datasets using LTP 4.0 are higher compared to those using SCT. Notably, there is a substantial gap in the $R_{oov}$ values between the two syntax parsing toolkits, particularly evident in the CITYU dataset. Examining Table 10, it becomes apparent that the model with LTP 4.0 can recognize certain OOV words, such as “”, “”, and “”.
In summary, across the six benchmark datasets, models employing LTP 4.0 exhibit superior performance in terms of both F1 and $R_{oov}$ values compared to those utilizing SCT.
4.7 Statistical significance tests
In this subsection, we aim to assess the statistical significance of the improvements presented in this paper by conducting F-score tests on different models.
Following Wang et al. (Reference Wang, Zong and Su2010), we use the bootstrapping method proposed by Zhang et al. (Reference Zhang, Vogel and Waibel2004), which is operated as follows.
Given a test set $T_0$ with $N$ test examples, we perform repeatedly sampling from $T_0$ , create a new test set $T_1$ with $N$ examples, and repeat the process $M-1$ times. Finally, we obtain a total of $M$ test sets $\{T_0, T_1,\ldots, T_M\}$ . In our test procedure, $M$ is set to 1000.
Subsequently, system $A$ scored $a_0$ on $T_0$ and system $B$ scored $b_0$ , the discrepancy between system $A$ and system $B$ is denoted as $\delta _0 = a_0 - b_0$ . Repeat this testing process on each test set resulting in $M$ discrepancy scores $\{\delta _0, \delta _1,\ldots, \delta _M\}$ . Following Zhang et al. (Reference Zhang, Vogel and Waibel2004), we measure the 95% confidence interval for the discrepancy (i.e., the 2.5th percentile and the 97.5th percentile) between the two models. If the confidence interval does not overlap with zero, it can be asserted that the differences between systems A and B are statistically significant (Zhang et al. Reference Zhang, Vogel and Waibel2004).
Table 11 is the example of significant differences between our system and system from Liu et al. (Reference Liu, Fu, Zhang and Xiao2021), where “>” means our system is significantly “better” than the system from Liu et al. (Reference Liu, Fu, Zhang and Xiao2021), demonstrating a steady improvement on both datasets.
5. Conclusion
In this paper, we introduce a novel framework for CWS named HGNSeg, focusing on the integration of heterogeneous features through GCNs. The diverse set of features, including words, n-grams, and dependency trees, along with their structural information, are effectively encoded using GCNs. Experimental results conducted on six benchmark datasets illustrate the efficacy and robustness of HGNSeg. Ablation experiments emphasize the model’s capability to seamlessly integrate syntax and lexicon features. The cross-domain experiments reveal that HGNSeg contributes to mitigating the OOV challenge. Additionally, a detailed analysis of various lexicons and syntax parsing toolkits suggests that a larger domain-specific n-gram lexicon and a superior parsing toolkit can significantly enhance the performance of HGNSeg.
In our proposed framework, HGNSeg did not leverage the label information of the dependency tree, which has the potential to provide additional valuable dependency information. Our future work will explore incorporating label information into the CWS framework to further enhance performance. It is worth noting that the HGNSeg framework is not limited to CWS; its applicability extends to other Chinese sequence labeling tasks, including NER and POS tagging.
Acknowledgments
This research is supported by the NSFC project “The Construction of the Knowledge Graph for the History of Chinese Confucianism” (Grant No. 72010107003).
Competing interests
All authors disclosed no relevant relationships.
Author contributions
Xuemei Tang: Conceptualization of this study, Methodology, Experiments, Writing, Original draft preparation.
Qi Su: Investigation, Data Curation, Supervision, Writing – Review and Editing.
Jun Wang: Methodology, Supervision, Project administration, Funding acquisition, Writing – Review and Editing.
Ethics statement
The datasets used in this paper are open datasets and do not involve any ethical issues.