1. Introduction
Metonymy is a type of figurative language that is pervasive in literature and in our daily conversation. It is commonly used to refer to an entity by using another entity closely associated with that entity (Lakoff and Johnson Reference Lakoff and Johnson1980; Lakoff Reference Lakoff1987; Fass Reference Fass1988; Lakoff Reference Lakoff1991, Reference Lakoff1993; Pustejovsky Reference Pustejovsky1991). For example, the following two text snippets show a word with literal usage and metonymic usage:
-
(1) President Vladimir Putin arrived in Spain for a two-day visit.
-
(2) Spain recaptured the city in 1732.
In the first sentence, the word ‘Spain’ refers to the geographical location or a country located in extreme southwestern Europe. However, in the second sentence, the meaning of the same word has been redirected to an irregular denotation, where ‘Spain’ is a metonymy for ‘the Spanish Army’ instead of its literal reading of the location name.
In natural language processing (NLP), metonymy resolution (MR) (Markert and Nissim Reference Markert and Nissim2002; Nissim and Markert Reference Nissim and Markert2003; Nastase and Strube Reference Nastase and Strube2009; Gritta et al. Reference Gritta, Pilehvar, Limsopatham and Collier2017; Li et al. Reference Li, Vasardani, Tomko and Baldwin2020) is a task aimed at resolving metonymy for named entities. MR attempts to distinguish the word with metonymic usage from literal usage given that word in an input sentence, typically location or organisation names. MR has been shown to be potentially helpful for various NLP applications such as machine translation (Kamei and Wakao Reference Kamei and Wakao1992), relation extraction (RE) (Chan and Roth Reference Chan and Roth2011) and geographical parsing (Monteiro, Davis, and Fonseca Reference Monteiro, Davis and Fonseca2016; Gritta et al. Reference Gritta, Pilehvar, Limsopatham and Collier2017; Li et al. Reference Li, Vasardani, Tomko and Baldwin2020). While other types of metonymies exist, in this paper, we are only interested in a specific type of conventional (regular) metonymy, namely, location metonymy. The task of location metonymy resolution (Markert and Nissim Reference Markert and Nissim2002; Gritta et al. Reference Gritta, Pilehvar, Limsopatham and Collier2017; Li et al. Reference Li, Vasardani, Tomko and Baldwin2020) constitutes classifying a location name within the given sentence into metonymic or literal class.
Although many named entity recognition (NER) systems and word sense disambiguation (WSD) systems exist, these systems generally do not explicitly handle metonymies. NER systems only identify entity names from a sentence, but they are not able to recognise whether a word is used metonymically. Existing WSD systems only determine which fixed ‘sense’ (interpretation) of a word is activated from a close set of interpretations, whereas metonymy interpretation is an open problem. They cannot infer the metonymic reading of a word out of the dictionary. Lakoff and Johnson (Reference Lakoff and Johnson1980) and Fass (Reference Fass1988) found that metonymic expressions mainly fell into several fixed patterns, most of which were quite regular. Therefore, recent methods for MR are mainly structured into two phases (Markert and Nissim Reference Markert and Nissim2002): metonymy detection Footnote a and metonymy interpretation (Nissim and Markert Reference Nissim and Markert2003). Metonymy detection attempts first to distinguish the usage of entity names between metonymic and literal. Then, metonymy interpretation determines which fine-grained metonymic pattern it involves such as place-for-people or place-for-event. The difference between metonymy detection and metonymy interpretation can be seen as from a coarse-grained (binary, metonymic or literal) to fine-grained (a particular type of metonymic expression) classification (Mathews and Strube Reference Mathews and Strube2020).
In computational linguistics, conventional feature-based methods for location MR (Nissim and Markert Reference Nissim and Markert2003; Farkas et al. Reference Farkas, Simon, Szarvas and Varga2007; Markert and Nissim Reference Markert and Nissim2007, Reference Markert and Nissim2009; Brun, Ehrmann, and Jacquet Reference Brun, Ehrmann and Jacquet2007; Nastase and Strube Reference Nastase and Strube2009; Nastase et al. Reference Nastase, Judea, Markert and Strube2012) rely heavily on handcrafted features delivered from either linguistic resources or off-the-shelf taggers and dependency parsers. These methods struggle with the problem of data sparsity and heavy feature engineering. Later, deep neural network (DNN) models (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Gritta et al. Reference Gritta, Pilehvar, Limsopatham and Collier2017; Mathews and Strube Reference Mathews and Strube2020) become mainstream in handling various NLP tasks, including MR. These models have better performances since they take more contextual information into account. Although DNN models provide a giant leap forward compared to feature-based methods, training high-performance DNN models requires large-scale and high-quality datasets. However, existing datasets for MR are rather small because the cost of collecting and annotating datasets is very expensive and unaffordable. This situation raises a need to transfer the knowledge from existing large-scale datasets. Recently, pre-trained language models (PLMs), especially BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), have shown superior performance on various NLP downstream applications (Sun, Huang, and Qiu Reference Sun, Huang and Qiu2019; Qu et al. Reference Qu, Yang, Qiu, Croft, Zhang and Iyyer2019; Lin et al. Reference Lin, Miller, Dligach, Bethard and Savova2019b). The main advantage of PLMs is that they do not need to be trained from scratch. When applying PLMs to a specific dataset, only some additional fine-tuning is required, which is much cheaper. Benefiting from being pre-trained on a large-scale dataset with efficient self-supervised learning objectives, PLMs can efficiently capture the syntax and semantics in the text (Tang et al. Reference Tang, Muller, Gonzales and Sennrich2018; Jawahar, Sagot, and Seddah Reference Jawahar, Sagot and Seddah2019). Therefore, it is natural to adopt BERT to generate entity representations for MR tasks.
However, directly adopting BERT into MR tasks might encounter problems. While BERT has a strong advantage in modelling lexical semantics and generates informative token embeddings, BERT has difficulty in fully modelling completed syntactic structures as it might need deeper layers to capture long-distance dependencies (Tang et al. Reference Tang, Muller, Gonzales and Sennrich2018; Zhang, Qi, and Manning Reference Zhang, Qi and Manning2018; Jawahar et al. Reference Jawahar, Sagot and Seddah2019). Given the sentence ‘He later went to manage Malaysia for one year’, BERT tends to focus more on the former verb ‘went’ and ignore the latter verb ‘manage’, which might lead to incorrect prediction of the MR label for ‘Malaysia’.Footnote b As shown in Figure 1, dependency parse trees that convey rich structural information might help to recognise the metonymic usage. Therefore, syntactic knowledge is necessary for improving BERT-based MR models.
Previous studies (Nissim and Markert Reference Nissim and Markert2003; Nastase and Strube Reference Nastase and Strube2009; Nastase et al. Reference Nastase, Judea, Markert and Strube2012) suggested that syntax was a strong hint in constructing metonymy routes. Both the lexical semantics and the syntactic structure (specifically, dependency relations) jointly assisted in recognising novel readings of a word. In a metonymic sentence, the target entity is artificially violated its fixed usage in order to introduce a novel metonymic reading, which was traditionally treated as syntactico-semantic violation (Hobbs and Martin Reference Hobbs and Martin1987; Pustejovsky Reference Pustejovsky1991; Chan and Roth Reference Chan and Roth2011). Generally, an entity is an argument to at least one predicate, there exist explicit syntactic restrictions on the entity and the predicate. In other words, the inference of metonymic reading primarily relies on the selectional preferences of verbs (Fass Reference Fass1988). As shown in Figure 1, ‘Malaysia’ refers to the national football team of Malaysia. The verbs and dependency arcs among verbs (coloured in a dark colour) were a strong clue to that metonymy, while other words (coloured in grey) had less contribution. This motivated us to explore an interesting question: Can jointly leveraging lexical semantics and syntactic information for MR can bring benefits?
As a part of ongoing interest in introducing prior syntactic knowledge into DNNs and PLMs, this paper investigates different ways to incorporate hard and soft syntactic constraints into BERT-based location MR models, following the idea that lexical semantics are potentially helpful for MR. Firstly, we employ an entity-aware BERT encoder to obtain entity representations. To force the model to focus on the target entity for prediction, we leverage explicit entity location information by inserting special entity markers before and after the target entity of the input sentence. Then, to take advantage of relevant dependencies and eliminate the noise of irrelevant chunks, we adopt two kinds of graph convolutional neural networks to impose hard and soft syntactic constraints on BERT representations in appropriate ways. Finally, the model selectively aggregates syntactic and semantic features to be helpful for MR inference. As a result, the proposed approach shows state-of-the-art (SOTA) performances on several MR benchmark datasets. To the best of our knowledge, this work is the first attempt to integrate syntactic knowledge and contextualised embeddings (BERT) for MR in an end-to-end deep learning framework.
2. Background: Metonymy resolution
Previous research in cognitive linguistics (Fundel, Küffner, and Zimmer Reference Fundel, Küffner and Zimmer2007; Janda Reference Janda2011; Pinango et al. Reference Pinango, Zhang, Foster-Hanson, Negishi, Lacadie and Constable2017) revealed that metonymic expressions are based on actual, well-established transfer relations between the source entity and the target referent, while those relations were not expected to be lexically encoded. Metonymy consists of a natural transfer of the meaning of concepts (Kvecses and Radden Reference Kvecses and Radden1998) which evokes in the reader or listener a deep ‘contiguity’ process. For instance, in a restaurant a waitress says to another, ‘Table 4 asked for more beer’, which involves a lesser-conventionalised circumstantial metonymy.Footnote c A large amount of knowledge is necessary to interpret this kind of metonymy (Zarcone, Utt, and Padó Reference Zarcone, Utt and Padó2012), for example, Table 4 cannot ask for beer, but the guests occupying Table 4 can. In contrast to circumstantial metonymy, systematic metonymy (also called conventional metonymy) is more regular (Nissim and Markert Reference Nissim and Markert2003), for example, producer-for-product, place-for-event and place-for-inhabitant. Such reference shifts occur systematically with a wide variety of location names. As ‘Malaysia’ refers to the national football team, as shown in Figure 1, the name of a location often refers to one of its national sports teams. It is easy to apply supervised learning approaches to resolve systematic metonymies by distinguishing between literal readings and a pre-defined set of metonymic patterns (e.g., place-for-event, place-for-people and place-for-product). Since the metonymic patterns place-for-event and place-for-product are rather rare in the natural language, the majority of MR works focus on the prediction of place-for-people.
Empirical methods for MR mainly fall into feature-based and neural-based approaches. Most feature-based methods (Brun et al. Reference Brun, Ehrmann and Jacquet2007; Nastase and Strube Reference Nastase and Strube2009; Nastase et al. Reference Nastase, Judea, Markert and Strube2012) based on statistical models are commonly evaluated on the SemEval 2007 Shared Task 8 benchmark (Markert and Nissim Reference Markert and Nissim2007). Markert and Nissim (Reference Markert and Nissim2002) were the first to treat MR as a classification task. Nissim and Markert (Reference Nissim and Markert2003) extracted more syntactic features and showed that syntactic head-modifier relations were important clues to metonymy recognition. Brun et al. (Reference Brun, Ehrmann and Jacquet2007) presented a hybrid system combining a symbolic and an unsupervised distributional approach to MR, relying on syntactic relations extracted by the syntactic parsers. Farkas et al. (Reference Farkas, Simon, Szarvas and Varga2007) applied a maximum entropy classier to improve the MR system on the extracted feature set. Nastase and Strube (Reference Nastase and Strube2009) expanded the feature set used in Markert and Nissim (Reference Markert and Nissim2007) with more sophisticated features such as WordNet 3.0 and WikiNet (Wikipedia’s category network). Nastase et al. (Reference Nastase, Judea, Markert and Strube2012) explored the usage of local and global contexts for the task of MR in a probabilistic framework. Nastase and Strube (Reference Nastase and Strube2013) used a support vector machine with a large-scale knowledge base built from Wikipedia. These feature-based methods suffer from error propagation due to their high dependence on the extraction process of handcrafted features. Furthermore, constructing the feature set requires external NLP tools and extra pre-processing costs.
Recently, the majority of MR models incorporate DNNs. Gritta et al. (Reference Gritta, Pilehvar, Limsopatham and Collier2017) first applied a BiLSTM neural network, called PreWin, to extract useful features from the predicate windows. During encoding, PreWin retained only the words and corresponding dependency labels within the predicate to eliminate noise in context. Rather than using BERT as a classifier after fine-tuning, Mathews and Strube (Reference Mathews and Strube2020) proposed to leverage BERT as an encoder to initialise word embeddings lightly; then, they fed the BERT’s embeddings into the PreWin system to perform MR.
PLMs have shown great success in many NLP tasks. Those models can produce context-aware and tokenwise pre-trained representations on a large-scale unlabelled dataset. They are fine-tuned on a downstream task and do not need to learn parameters from scratch. These models, such as ELMo (Peters et al. Reference Peters, Ammar, Bhagavatula and Power2017; Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018) and BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), considerably surpass competitive neural models in many NLP tasks (Socher et al. Reference Socher, Perelygin, Wu, Chuang, Manning, Ng and Potts2013; Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016). PLMs obtained new SOTA results on language understanding tasks and are outstanding in capturing contextual or structural features (Glavaš and Vulić Reference Glavaš and Vulić2021). Intuitively, introducing pre-trained models to MR tasks is a natural step. Li et al. (Reference Li, Vasardani, Tomko and Baldwin2020) first attempted to use the BERT framework for MR tasks directly. Given the vast range of entities in the world, it is impossible to learn all entity mentions. To address data sparsity and force the model to make predictions based only on context, Li et al. (Reference Li, Vasardani, Tomko and Baldwin2020) proposed a word masking approach based on BERT by replacing all target entity names with an [X] token during training and inference. The masking approach substantially outperformed existing methods over a broad range of datasets.
Despite their successes, they did not investigate the role of syntax and how syntax affects MR. However, identifying the metonymic usage of an entity should collaboratively rely on both the entity and the syntax. The above issue motivated us to concentrate on modelling dependency associations among words that may be potentially helpful for MR to enrich BERT representations.
3. Related works
Since entity names are often used in a metonymic manner, MR has a strong connection with other NLP tasks such as WSD and RE. These tasks share similar pre-processing techniques and neural network architectures in utilising syntactic information (Joshi and Penstein-Rosé Reference Joshi and Penstein-Rosé2009; Li et al. Reference Li, Wei, Tan, Tang and Ke2014; Peng et al. Reference Peng, Poon, Quirk, Toutanova and Yih2017; Zhang et al. Reference Zhang, Qi and Manning2018; Fu, Li, and Ma Reference Fu, Li and Ma2019). Integrating dependency relations with DNN models has shown promising results for various NLP tasks (Joshi and Penstein-Rosé Reference Joshi and Penstein-Rosé2009; Li et al. Reference Li, Wei, Tan, Tang and Ke2014; Peng et al. Reference Peng, Poon, Quirk, Toutanova and Yih2017; Zhang et al. Reference Zhang, Qi and Manning2018; Fu et al. Reference Fu, Li and Ma2019). However, the effect of dependency integration for neural-based MR models is still not recognised and has made limited progress so far.
With recent advances in RE (Zhang et al. Reference Zhang, Qi and Manning2018; Wu and He Reference Wu and He2019; Guo, Zhang, and Lu Reference Guo, Zhang and Lu2019), we investigate the use of dependency integration for MR. Our first concern is the integration approach, whether directly concatenating dependency embeddings with token embeddings or imposing dependency relations using a graph model is more appropriate. Extensive works have discussed this issue, and most of them treated dependency relations as features. For example, Kambhatla (Reference Kambhatla2004) trained a statistical classifier for RE by combining various lexical, syntactic and semantic features derived from the text in the early data pre-processing stage. Zhang, Zhang, and Su (Reference Zhang, Zhang and Su2006) studied embedding syntactic structure features in a parse tree to help RE. As a result, those models were sensitive to linguistic variations, which prevented further applying the dependency integration approach.
Recent research employs graph-based models to integrate DNNs and dependency parse trees. A variety of hard pruning strategies relying on pre-defined rules have been proposed to distil dependency information that improves the performance of RE. For example, Xu et al. (Reference Xu, Mou, Li, Chen, Peng and Jin2015) used the shortest dependency path between the entities in the entire tree. Liu et al. (Reference Liu, Wei, Li, Ji, Zhou and Wang2015) combined the shortest dependency path between the target entities using a recursive neural network and attached the subtrees to the shortest path with a convolutional neural network. To leverage hierarchy information in dependency parse trees, Miwa and Bansal (Reference Miwa and Bansal2016) performed bottom-up or top-down computations along the parse tree or the subtree below the lowest common ancestor (LCA) of the entities. Zhang et al. (Reference Zhang, Qi and Manning2018) pruned words except for the immediate ones around the shortest path, given that those words might hold vital information to hint at the relation between two target entities. They applied graph convolutional network (GCN) to model the dominating dependency tree structures. Although these hard pruning methods remove irrelevant relations efficiently, some useful information may also be eliminated. To resolve the above conflicts, Guo et al. (Reference Guo, Zhang and Lu2019) proposed a soft pruning method called AGGCN (attention-guided graph convolutional network), a model that pools information over dependency trees by using GCN. They transform original dependency trees into fully connected edge-weighted graphs, balancing the weights of dependency relations between including and excluding information. Note that dependency-guided approaches, such as Zhang et al. (Reference Zhang, Qi and Manning2018) and Guo et al. (Reference Guo, Zhang and Lu2019), worked on the RE task. To the best of our knowledge, we are the first to incorporate syntactic constraints into BERT-based models for MR.
4. Proposal
The task addressed in this paper is MR. Given an entity name E within a sentence S, MR predicts whether E involves a metonymic or literal usage. The critical insight of this paper is that incorporating syntactic constraints may help BERT-based MR. As shown in Figure 1, the closest governing verb in the dependency parse tree plays a dominant role in resolving metonymies. Therefore, we consider that lexical semantics and syntactic structure essential for identifying metonymies.
Figure 2 illustrates the overall architecture of the proposed model. We propose an end-to-end neural-based approach for MR tasks and train the model based on recent advances in PLMs. Since BERT has shown superior performance on various NLP tasks, we employ BERT as an input encoder to produce tokenwise semantic representations by passing the input sentences through the BERT encoder. To enrich these tokenwise representations with syntactic knowledge given dependency parse trees, we propose two ways to incorporate syntactic constraints using different types of GCNs, for example, non-attentive GCN and attentive GCN (AGCN). We first perform dependency parsing for input sentences to extract corresponding dependency parse trees and then convert those parse trees into dependency adjacency matrices. Then, we use the GCN to encode dependency adjacency matrices explicitly. However, vanilla GCNs represent the adjacency edges among nodes using hard 0 and 1 labels. To learn these weights, following Guo et al. (Reference Guo, Zhang and Lu2019), we adopt the self-attention mechanism (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) upon GCNs to tune the weights. As a result, the final representations contain rich syntactic knowledge, and lexical semantics serve to make predictions.
4.1 Entity-aware encoding with BERT
BERT, consisting of multi-layer bidirectional transformers, is designed to produce deep bidirectional representations. The BERT encoding layer uses a multi-layer transformer encoder to generate a sentence-level representation and fine-tuned contextual token representations for each token. We omit a detailed architecture description of BERT and only introduce the primary part of the entity-aware BERT encoder. Concretely, we pack the input as $[CLS, S_t, SEP]$ , where [CLS] is a unique token for classification, $S_t$ is the token sequence of S generated by a WordPiece Tokenizer and [SEP] is the token indicating the end of the sentence. Our model takes the packed sentence S as input and computes context-aware representations. Following Wu and He (Reference Wu and He2019), which enriches BERT with entity information for relation classification, we insert special [ENT] indicators before and after the entity nominal. This simple approach lets BERT easily locate the MR entity position. For each $h^0_x$ at the index x, we concatenate initial token embeddings with positional embeddings and segment embeddings as follows:
After going through N successive transformer encoder blocks, the encoder generates entity-aware BERT representation at the x-th position represented by $h_x^N$ as follows:
4.2 Alignment
BERT applies WordPiece Tokenizer (a particular type of subword tokenizer) to further segment words into word pieces, for example, from ‘played’ to [‘play’, ‘##ed’]. However, dependency parsing relies on words and hence does not execute further segmentation. Thus, we need to align BERT’s tokens against the input words and restore word representations by adopting the average pooling operation on BERT’s token representations. Assume $h_x,\dots,h_y$ are BERT representations of tokens (x and y represent the start and end indices of the token sequence), we obtain the embedding $\tilde{h}_i$ of the i-th word byFootnote d:
4.3 Syntactic integration
Our model requires both the GCN and AGCN layers for data processing purposes. The nonattention GCNs are inserted before the AGCNs to impose dependency graphs explicitly. Then, the AGCNs learn the soft weights of edges in the dependency graph. The syntactic integration layer enriches the final representations with dependency relation information, making them both context- and syntax-aware.
Although the GCN layer is similar to the AGCN layer in architecture, the main difference between the former and the latter is whether the attention matrix A is initialised by directly using the dependency adjacency matrix A or computing multi-head self-attention scores given $\tilde{H}$ . The choice of A is dependent on the present application of the GCN or AGCN, which can be expressed as follows:
$\varphi$ is a soft attention function, such as additive (Bahdanau, Cho, and Bengio Reference Bahdanau, Cho and Bengio2015), general dot-product (Luong, Pham, and Manning Reference Luong, Pham and Manning2015) or scaled dot-product (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) attention. Therefore, the attention-guided layer composes both the attentive and nonattentive modules. We use the scaled dot-product attention in our model for efficiency.
4.3.1 Dependency-guided layer
Many existing methods have adopted hard dependency relations (i.e., 1 or 0 indicates whether the association exists or not, respectively) to impose syntactic constraints. These methods require handcrafted rules based on expert experience. Moreover, hard rules set dependency relations considered as irrelevant as zero weights (not attended), which may develop biased representations, especially towards sparser dependency graphs.
We adapt the graph convolution operation to model dependency trees by converting each tree into its corresponding adjacency matrix A. In particular, $A_{ij}=A_{ji}=1$ if there exists a dependency edge between word i and word j; otherwise, $A_{ij}=A_{ji}=0$ . Empirically, a self-loop is necessary for an addition to edges. Then, in a multi-layer GCNs, the node (word) representation $\tilde{h}_i^{(l)}$ is produced by applying a graph convolution operation in layers from 1 to $l-1$ . The convolutional operation can be described as follows:
where $W^{(l)}$ represents the weight matrix, $b^{(l)}$ denotes the bias vector and $\rho$ is an activation function. $\tilde{h}^{(l-1)}$ and $\tilde{h}^{(l)}$ are the hidden states in the prior and current layers, respectively. Each node gathers and aggregates information from its neighbouring nodes during graph convolution.
4.3.2 Attention-guided layer
The incorporation of dependency information is more challenging than just imposing dependency edges. How to use relevant information while ignoring irrelevant information from the dependency trees remains a problem. Hard pruning methods (Zhang et al. Reference Zhang, Qi and Manning2018) are likely to prune one of the long sentences containing two verbs, causing information loss. Guo et al. (Reference Guo, Zhang and Lu2019) proposed adopting multi-head self-attention mechanism (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) as a soft pruning strategy to extract relevant features from dependency graphs. Guo et al. (Reference Guo, Zhang and Lu2019) introduced attention-guided GCN (called AGCN) to represent graphs as an alternative to previous GCNs (Kipf and Welling Reference Kipf and Welling2017). AGCN relies on a large, fully connected graph to reallocate the importance of each dependency relation rather than hard-pruning the graph into a smaller or simpler structure. The soft pruning strategy of distributing weight to each word can partly avoid this problem.
Generally, the AGCN layer models the dependency tree with the soft attention $\tilde{A}$ , in which each cell weight ranges from 0 to 1. The shape of $\tilde{A}$ is the same as the original dependency adjacency matrix A for convenience. We compute attention weights in $\tilde{A}$ by using the multi-head attention (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017). For the k-th head, $\tilde{A}^{(k)}$ is computed as follows:
where Q and K are the query and the key in multi-head attention, respectively, Q and K are both equal to the input representation $\tilde{H}$ (i.e., the output of the last module), d denotes the dimension of $\tilde{H}$ , $W^Q_i$ and $W^K_i$ are both learnable parameters $\in \mathbb{R}^{d\times d}$ , and $A^{(k)}$ is the k-th attention-guided adjacency matrix corresponding to the k-th head. Thus, we can replace the hard matrix A in the previous equation with the soft attention matrix $A^{(k)}$ . The dependency relations, especially the indirect, multi-hop ones, are modelled by the multi-head mechanism.
4.3.3 Densely connected structure
Previous work (Huang et al. Reference Huang, Liu, Van Der Maaten and Weinberger2017) has proven that dense connections across GCN layers helps to capture structural information. Deploying the densely connected structure forces the model to learn more non-local information and train a deeper GCN model. In our model, each densely connected structure has L sublayers placed in regular sequence. Each sublayer takes the outputs of all preceding sublayers as input, as shown in Figure 3. We define the output of the densely connected structure $g^{(l)}_j$ as follows:
where $x_j$ is the initial representation outputted by the alignment layer. $\tilde{h_j}^{(1)},\dots,\tilde{h}_j^{(l-1)}$ denote a concatenation of the representations produced by preceding sublayers. In addition, the dimension of representations in these sublayers shrinks to improve parameter efficiency, that is, $d_{hidden}=d/L$ , where L is the number of sublayers, and d is the input dimension, with three sublayers and an input dimension of 768, $d_{hidden}=d/L=256$ . It outputs a fresh representation whose dimension is $768(256 \times 3)$ by concatenating all these sublayer outputs. Thus, the layer conserves considerable information at a low computational cost. This layer helps the weight gradually flow to the determining token. N densely connected layers are constructed to compute N adjacency matrices produced by attention-guided layers, where N denotes the numbers of the head. The GCN computation for each sublayer should be modified to adapt the multi-head attention as follows:
where k represents the k-th head, $W_k^{(l)}$ and $b_k^{(l)}$ are the learnable weights and bias, respectively, which are selected by k and associated with the attention-guided adjacency matrix $A^{(k)}$ .
4.3.4 Multi-layer combination
In this layer, we combine the representations outputted by N densely connected layers corresponding to N heads to generate the final latent representations:
where $\tilde{h}_{out}\in \mathbb{R}^d$ is the aggregated representation of N heads. $W_{out}$ and $b_{out}$ are the weights and biases learned during training.
4.4 Classifier
This layer maps the final hidden state sequence H to the class metonymic or literal. The representation $H_i$ corresponds to the token $t_i$ . Specifically, $H_0$ denotes ‘[CLS]’ at the head of the subword sequence after tokenisation, which serves as the pooled embedding to represent the aggregate sequence.
Suppose that $\tilde{h}_{x'}, \dots, \tilde{h}_{y'}$ are the word representations against the entity span E outputted by the syntactic integration layer. x $^{\prime}$ and y $^{\prime}$ represent the start and end index of the words in the entity span, respectively. We apply an operation of average pooling to obtain the final entity encoding:
For classification, we concatenate $H_0$ and ${H}_e$ consecutively, applying two fully connected layers with activation. Then, we apply a softmax layer to make the final prediction. The learning objective is to predict metonymic and literal classes for an entity within a given sentence:
where $\hat{\gamma}$ refers to a class type in the metonymy type set $\Gamma$ . $W^{'}\in\mathbb{R}^{d \times 2d},W^{*}\in\mathbb{R}^{r \times d}$ , $|{\Gamma}|$ is the number of classification types, and d is the dimension of the hidden representation vector. While there are only two classes in this task, this approach can generalise to multiple classes.
5. Experiments
5.1 Datasets
We conducted our experiments mainly on three publicly available benchmarks: two small size location metonymy datasets, SemEval (Markert and Nissim Reference Markert and Nissim2007) and ReLocaR (Gritta et al. Reference Gritta, Pilehvar, Limsopatham and Collier2017), and a large size dataset, WiMCor (Mathews and Strube Reference Mathews and Strube2020). SemEval and ReLocaR are created to evaluate the capability of a classifier to distinguish literal (geographical territories and political entities), metonymic (place-for-people, place-for-product, place-for-event, capital-for-government or place-for-organisation) and mixed (metonymic and literal frames invoked simultaneously or are unable to distinguish) location mentions.
SemEval : The SemEval datasetFootnote e focuses on locations retrieved from the British National Corpus. The distribution of categories in the SemEval dataset is approximately 80 $\%$ literal, 18 $\%$ metonymic and 2 $\%$ mixed to simulate the natural distribution of location metonymy. Therefore, a literal default tag already provides 80 $\%$ precision. Although it contains finer-grained labels of metonymic patterns, such as place-for-people, place-for-event or place-for-product, we use only coarse-level labels of metonymy or literal in the experiment. Our experiment excluded the mixed class since it accounts for only 2 $\%$ of the data. Finally, the dataset comprises training (910 samples) and testing (888 samples) partitions.
ReLocaR : The ReLocaR datasetFootnote f was collected using the sample data from Wikipedia’s Random Article API. The data distribution of ReLocaR classes (literal, metonymic and mixed) is approximately 49 $\%$ , 49 $\%$ and 2 $\%$ , respectively. We excluded mixed class instances. The processed dataset contains 1026 training and 982 testing instances and has a better label balance to eliminate the bias due to sub-sampling of the majority class to balance the classes.
WiMCor : The above datasets are limited in size. We also conduct our experiments on a large harvested corpus of location metonymy called WiMCor.Footnote g WimCor is composed of a variety of location names, such as names of towns (e.g., ‘Bath’), cities (e.g., ‘Freiburg’) and states (e.g., ‘Texas’). The average sentence length in WiMCor is 80 tokens per sentence. While the samples in WiMCor are annotated with coarse-grained, medium-grained and fine-grained labels, only the coarse labels (binary, i.e., metonymic or literal) are used in our experiments. The training set contains 92,563 literal instances and 31,037 metonymic instances.
5.2 Setup
5.2.1 Data pre-processing
This section introduces the way to obtain dependency relation matrices. We performed dependency parsing using the spaCy parserFootnote h and transformed all dependency trees (one parse tree per sentence) into symmetric adjacency matrices, ignoring the dependency directions and types for simplicity. In preliminary work, we conducted experiments using asymmetric matrices, but we did not observe any improvements.
For BERT variants, we followed Devlin et al. (Reference Devlin, Chang, Lee and Toutanova2019) and used the tokenizer in BERT to segment words into word pieces as discussed in Section 4.1. We inserted the special [ENT] indicator before and after the entity spans as Wu and He (Reference Wu and He2019) did for E-BERT experiments. To adapt the sequence length distribution corresponding to each dataset, we set the max sequence length to 256 for SemEval, and 128 for ReLocaR and WiMCor.
5.2.2 Comparison setting
To evaluate our approach, we compared our model with different previous models: SVM, BiLSTM and PLMs, for example, BERT and ELMo. We took the current SOTA MR system BERT+MASK (Li et al. Reference Li, Vasardani, Tomko and Baldwin2020) as the baseline, which did not include additional entity information. Following the best practices in Devlin et al. (Reference Devlin, Chang, Lee and Toutanova2019) and Guo et al. (Reference Guo, Zhang and Lu2019), we constructed the baseline and GCN models and set up the hyperparameters.
5.2.3 Training details
For all BERT-based models, we initialised the parameters of the BERT encoder using the pre-trained models released by Huggingface.Footnote i By launching two unsupervised tasks, namely, masked language model and the next sentence prediction (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) on the large pre-training corpus, the sentence tokens are well represented. We empirically found that the large-cased-BERT model, which is case-sensitive and contains 24 transformer encoding blocks, each with 16 self-attention heads and 1024 hidden units, provides the best performance on the experimental datasets. The number of sublayers $L_p$ for GCN block and $L_n$ for ACGN block are 2 and 4, respectively.
For the SemEval and ReLocaR datasets, we set the batch size to 8 and the number of training epochs to 20. For the WiMCor dataset, we trained for 1 epoch and then reach convergence. We chose the number of heads for the multi-head attention N from the set $\{1, 2, 4, 6, 8\}$ , and the initial learning rate for AdamW lr from the set $\{5\times10^{-6},1\times10^{-5},2\times10^{-5}\}$ . The small learning rate yields more stable convergence and optimal results during model training but underfitting during training. A proper value for learning rate should be $1\times10^{-5}$ or $2\times10^{-5}$ . We chose these hyperparameters based on our experience obtained from extensive preliminary experiments, given the trade-off between time cost and performance depending on datasets. The combinations of ( $N=8,lr=1\times10^{-5}$ ), ( $N=4,lr=2\times10^{-5}$ ) and ( $N=4,lr=2\times10^{-5}$ ) provided the best results on SemEval, ReLocaR and WiMCor datasets, respectively. E-BERT+AGCN requires approximately 1.5 times the GPU memory compared with BERT when training the models on a Tesla V100-16GB GPU.
5.3 Results
5.3.1 Models
We compared our proposed method with different MR methods to evaluate it. The task of location MR is to detect the locations with literal reading only and ignore all other possible readings. Following Gritta, Pilehvar and Collier (Reference Gritta, Pilehvar and Collier2020), we classify the entity phrase as either literal or metonymic. The baseline models used in our experiments are listed below.
SVM+Wiki: SVM+Wiki is the previous SOTA statistical model. It applies SVM with Wikipedia’s network of categories and articles, enabling the model to automatically discover new relations and their instances.
LSTM and BiLSTM: LSTM is one of the most powerful dynamic classifiers publicly known (Sundermeyer, Schlüter, and Ney Reference Sundermeyer, Schlüter and Ney2012). Thanks to the featured memory function of remembering the last hidden states, it achieves decent results and is widely used on various NLP tasks (Gao et al. Reference Gao, Choi, Choi and Zettlemoyer2018; Si et al. Reference Si, Chen, Wang, Wang and Tan2019). Moreover, BiLSTM improves the token representation by being aware of the conditions from both directions (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997), making ture contextual reasoning available. Additionally, two kinds of representations, GloVe (Pennington, Socher and Manning Reference Pennington, Socher and Manning2014) and ELMo, are tested separately to ensure model reliability.
Paragraph, Immediate and PreWin: Three models, Paragraph, Immediate and PreWin, are built upon BiLSTM models. They simultaneously encode tokens into word vectors and dependency relation labels into one-hot vectors (generally 5–10 tokens selected from the left and right of the entity work best). The three models differ in the manner of token picking. Immediate x chooses the x number of words to the immediate right and left of the entity as input to the model (Collobert et al. Reference Collobert, Weston, Bottou, Karlen, Kavukcuoglu and Kuksa2011; Mesnil et al. Reference Mesnil, He, Deng and Bengio2013; Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Baroni, Dinu, and Kruszewski Reference Baroni, Dinu and Kruszewski2014), for example, Immediate-5/10 takes the 5/10 words to the immediate right and left of the entity as input to a model. The Paragraph model extends the Immediate model that takes more words (50 words) from each entity’s side as the input. PreWin selects the words near the local predicate to eliminate long-distance noise in the input.
PreWin (BERT) is the reimplementation of the PreWin system with BERT embeddings as the input. Instead of deploying BERT as a classifier, we replace the original GloVe embeddings with BERT embeddings used in the PreWin model and initialise word embeddings using BERT embeddings. Word embeddings are combined by summing subword embeddings to generate GloVe-like word embeddings.
BERT, +AUG, +MASK: Three BERT-based MR models are described in Li et al. (Reference Li, Vasardani, Tomko and Baldwin2020). The vanilla BERT model (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) can be directly used to detect metonymies by performing sentential classification. BERT encodes the input tokens into distributed vector representations after fine-tuning over datasets. BERT+AUG is fine-tuned with data augmentation (Li et al. Reference Li, Vasardani, Tomko and Baldwin2020). This method generates new samples by randomly substituting the target entity nominal with one from all the extracted target words. BERT+MASK fine-tunes the BERT model with target word masking that replaces the input target word with the single token [ENT] during training and evaluation.
E-BERT (sent) and E-BERT (sent+ent): Entity-aware BERT, namely, E-BERT, enriches the semantic representations by incorporating the entity information. The input to the E-BERT (sent) model is slightly different from the original dataset, where we inserted [ENT] markers before and after the entity spans, making BERT aware of the entity position. The E-BERT (sent) model represents the sentence using the encoding at the [CLS] position. The E-BERT (sent+ent) model shares the same network structure as the R-BERT model (Wu and He Reference Wu and He2019) for RE, but it depends on a sole entity. Concretely, this variation concatenates the target entity’s sentential encoding and corresponding encoding.
E-BERT+GCN: This model applies a hard pruning strategy using GCN computation to integrate syntactic information into BERT representations. The input sentences are inserted with the [ENT] label before and after the metonymic and literal entity span.
E-BERT+AGCN: We build the fully attentive system E-BERT+AGCN based on E-BERT+GCN. The attention-guided layer in E-BERT+AGCN employs a soft attention mechanism to assign proper weights to all dependencies. Figure 4 illustrates all BERT variants used in this paper, including BERT, E-BERT, E-BERT+GCN and E-BERT+AGCN.
5.3.2 Overall evaluation
We compared the averaged F1 and accuracy scores by running each model 10 times (see Table 2). In the accuracy comparison, the performance of the feature-based model SVM+Wiki is still superior to most of the recent DNN models. LSTMs yielded better results due to operations, such as structure variation modelling, word representation improvement and feature integration. Of note, NER and part-of-speech (POS) features have less of an effect on BiLSTM (ELMo). The semantics provided by POS/NER features might be redundant for ELMo representations. PreWin surpasses the baseline LSTM (GloVe) by a large margin on both ReLocaR and SemEval datasets. The result indicates the significance of syntactic information. In addition, the process of choosing and retaining related tokens is also a major contributor to PreWin, resulting in at least 1.2 and 2.2 points higher than Paragraph and Immediate on SemEval and ReLocaR, respectively.
The results of E-BERT+GCN and E-BERT+AGCN show that our model is able to leverage syntactic information that is useful for MR and demonstrates its advantages over previous works with SOTA results. Specifically, E-BERT+AGCN considerably outperforms the previous model based on heavy feature engineering, that is, SVM+Wiki. Our model also surpassed previous DNN models, including LSTM, BiLSTM and PreWin, even when enriched with POS and NER features. Furthermore, we compared E-BERT+AGCN with two baseline models: E-BERT (the entity-aware BERT model without syntactic integration) and E-BERT+GCN (imposing hard syntactic constraints with GCN).
Moreover, the experiment on E-BERT+GCN shows an accuracy increase that is 0.3% and 0.2% higher than E-BERT (sent+ent) on the SemEval and ReLocaR datasets, respectively. GCN improves performance by catching useful information from syntax. The hard pruning behaviour of Immediate 5 exerted on E-BERT+GCN has little effect, which shows that pruning graphs crudely may be counterproductive. E-BERT+AGCN obtains improvements of 0.7% and 0.2% on the SemEval and ReLocaR datasets, respectively, compared with E-BERT+GCN. Therefore, introducing a multi-head attention mechanism that assists GCNs in information aggregation seems successful. The standard deviation of E-BERT+AGCN is also lower than E-BERT+GCN, indicating a more robust model performance. Our approach effectively incorporates soft-dependency constraints into MR models by pruning irrelevant information and emphasising dominant relations concerning indicators.
We also report F1 scores for literal class and metonymic class separately. ReLocaR is a class-balanced dataset with literal and metonymic independently accounting for 50% of all examples in the training dataset. The F1 score of ReLocaR is relatively higher than that of the SemEval dataset due to the shorter sentence length. In the ReLocaR rows, the F1 of both classes indicates a slight upgrade compared with baseline E-BERT and E-BERT+GCN, since the standard deviations are relatively higher. Conversely, SemEval serves as a benchmark with literal and metonymic accounting for 80% and 20%. The imbalance causes a lack of metonymic evidence, making the model learning process insufficient. As reflected in Table 2, earlier models, such as LSTM, have an inferior F1 performance on the metonymic class compared with the literal class. The considerable performance gap of 3.4% and 4.0% in F1-M between BERT and E-BERT+AGCN shows that E-BERT+AGCN is more powerful in capturing syntactic clues to solve the sample limitation. To summarise, E-BERT+AGCN achieves the highest F1 scores for both SemEval and ReLocaR and is able to adapt to various class distributions in the dataset.
In addition, to verify the effectiveness of our model on a larger dataset, we launch the experiment using the WiMCor dataset. Table 3 also gives the results on the WiMCor dataset. Though the increase is not substantial in terms of accuracy or F1 scores, our model leads to a 0.2 percentage point improvement compared to E-BERT, given the fact that the WiMCor testing set contains 41,200 instances.
5.3.3 Cross-domain evaluation
Since the metonymy datasets were created using different annotation guidelines, it is informative to study the generalisation ability of the developed models. Thus, we launched across-domain experiments to evaluate the model’s performance under cross-domain configurations as follows: WiMCor $\rightarrow$ ReLocaR, WiMCor $\rightarrow$ SemEval, SemEval $\rightarrow$ ReLocaR, ReLocaR $\rightarrow$ SemEval. We trained models on one dataset for all configurations and tested them on another dataset and then compared the three models, E-BERT, E-BERT+GCN and E-BERT+AGCN. As shown in Table 4, the result on the WiMCor dataset indicates the robustness of E-BERT+AGCN on a large benchmark. Although incorporating hard syntactic constraints improves the MR results slightly, the soft constraint is more efficient than the hard in the experiments in terms of accuracy.
5.3.4 Sentence length
We further compared the accuracy of E-BERT+AGCN and E-BERT with different sentence lengths, as shown in Figure 5. The experiment is conducted on both SemEval and the ReLocaR datasets. Note that the average sentence length of ReLocaR is shorter than that of the SemEval.
To highlight the issues, we primarily discuss the SemEval dataset. Long sentences are likely to affect the classification accuracy and cause poor performance for two reasons: 1. contextual meanings for long sentences are more difficult to capture and represent; 2. the position of key tokens, like the predicate, can be far from the entity in a sentence, therefore, difficult to determine. Thus, it is challenging for sequence-based models to retain adequate performance when tested with long sequences. BERT fails to utilise structural information such as dependency trees that have been proven to benefit NLP tasks. Previous studies (Tang et al. Reference Tang, Muller, Gonzales and Sennrich2018) have shown that BERT lacks model interpretability for non-local syntactic relations, for example, long-distance syntactic relations. The accuracy drops as in Figure 5 for BERT when the sentence length grows. In this case, a dependency-based model is more suitable for handling long-distance relations while reducing computational complexity. E-BERT+AGCN alleviated such performance degradation and outperformed the two baselines in all buckets, and the improvement becomes more significant when the sentence length increases ( $\geq 30$ on ReLocaR and $\geq 66$ on SemEval). The results in Figure 5 confirm that E-BERT+AGCN produces better entity and contextual representations for MR, especially for longer sentences.
5.3.5 Case study
This section describes is a case study using a motivating example correctly classified by E-BERT+AGCN but misclassified by E-BERT to show the effectiveness of our model. Given the sentence, ‘He later went to manage Malaysia for one year’, a native speaker can easily identify ‘Malaysia’ as a metonymic term by linking ‘Malaysia’ to the concept of ‘the national football team of Malaysia’. In the above sentence, since the verb phrase, ‘went to’ is a strong indicator, E-BERT is prone to overlook the second predicate ‘manage’. As a result, E-BERT would falsely recognises ‘Malaysia’ as a literal territory (due to customary usage of ‘went to somewhere’). We can explain how the problem mentioned above is resolved in E-BERT+AGCN by visualising the attention weights of the model. We first compare the attention matrix of the transformer encoder blocks in E-BERT to check the contribution of syntactic integration to the whole model. Figure 6(a) shows that the weights of tokens in BERT are decentralised. Intuitively, E-BERT provides insufficient semantic knowledge to MR, resulting in the lose of useful information as to which target words should be considered. In Figure 6(b), the attention in E-BERT+AGCN concentrates on the predicate ‘manage’ and the entity ‘Malaysia’ rather than on ‘went to’ and other tokens. With the integration of syntactic components, the dependency information assisted the model in being aware of the sentence structure. As a result, our model selects relevant tokens and discards the irrelevant and misleading tokens.
Furthermore, the sentence in the example can be divided into the main clause ‘He later went to manage Malaysia’ and the prepositional phrase ‘for one year’. The main clause contains the predicate and the entity that dominates the MR inference. However, conventional methods consider the modified relation between ‘one’ and ‘year’ as well as other irrelevant connections to have the same weight. This process introduces massive noise in feature extraction.
As shown in Figure 6(b), the prepositional phrase ‘for one year’ is irrelevant to the MR task. Despite the existence of dependency relations for the prepositional phrase, the weights of those relations are relatively lower compared with the main clause, which includes the verb and its dependent words. After launching the multi-head attention mechanism, the model is free from fixed pruning rules and flexibly learns the connections among tokens.
The syntactic component (GCN Block) first selects relevant syntactic features efficiently given the hard dependency adjacency matrix (see Figure 6(c)). Then, the attention-guided layer learns the soft attention matrix. To demonstrate the superiority of soft dependency relations, we use Figure 6(d) to visualise the attention weights of the attention-guided layer. Unlike the attention in the BERT encoding layer, the attention-guided layer’s attention matrix reflects more information about dependency relationships. GCN is prone to trust information for all its one-hop neighbours in dependency graphs while overlooking other neighbours. In contrast, AGCN uses multi-head attention to attend to different representation subspaces to reduce the information loss jointly.
6. Error analysis
In most cases shown in Table 5, E-BERT+AGCN makes correct predictions. However, typical issues caused by various reasons remain unsolved. We will discuss three types of such unsolved errors here.
6.1 Error propagation
E-BERT+AGCN predicts the entity type given dependency relations and contextualised embeddings. Concretely, S1 shows an example that presents isolation of the term ‘Marseille 2015’ in the dependency parse tree. The subtree of ‘Marseille 2015’ and the remaining parts are split by parentheses. In an extreme case, E-BERT+AGCN fails to recognise ‘Marseille’ as ‘a sport event in Marseille’. We found that the connection between ‘beats’ and ‘Marseille’ is missing in the parse tree.
6.2 Multiple predicates
The predicate plays a crucial role in understanding a sentence (Shibata, Kawahara, and Kurohashi Reference Shibata, Kawahara and Kurohashi2016). In MR tasks, if an entity exerts an action on others, it is probably a metonym. S2 is a typical example of long distance between the predicate and the entity. In this situation, BERT fails to classify and cannot catch non-local features due to long distances. Meanwhile, E-BERT+AGCN correctly predicts metonymy by relying on the related entity and the predicate, ‘engaged’. In other words, E-BERT+AGCN can mine strong connections in sentences with a relatively straightforward and smart effort. The observation proves again that the syntactic component is efficient in searching keywords.
In more complex cases, our method might fail to detect metonymy. For example, in S3, conventional models might easily find the predicate ‘uprising’. The event participant is ‘Schleswig-Holstein’s large German majority’ rather than ‘Schleswig-Holstein’. E-BERT+AGCN could not trace the predicate and made an incorrect prediction even though it was aware of syntactic structural knowledge.
6.3 Knowledge deficiency
Many metonymies are proper nouns that refer to existing well-known works or events. Previous models struggle with the limitations of lacking real-world and common-sense knowledge to access the referents, which results in poor interpretability. In contrast, in sentence S4, ‘South Pacific’ refers to a film in 1958. E-BERT fails to recognise such a metonym, while E-BERT+AGCN successfully extends the implication of ‘South Pacific’ to the film called South Pacific due to the dependency between ‘film’ and ‘South Pacific’. S5 is one of the failed cases. In sentence S5, E-BERT+AGCN fails to detect metonymy, notwithstanding the explanation of ‘Hillsborough’, referring to ‘Fever Ship’, which has been mentioned in the discourse. In fact, ‘Hillsborough’ is a ship name involved in unconventional metonymy. As discussed in Section 2, identifying such a logical metonymy is difficult since the interpretation requires additional, inferred information.
7. Conclusion and future work
This paper shows the success of a neural architecture deploying a dependency-guided network to capture non-local clues for MR. This approach incorporates hard and soft dependency constraints into MR, enabling context- and syntax-aware representations. Experimental results and analyses on MR benchmark datasets showed that our proposed architecture surpasses previous approaches. Our work also demonstrates the importance of syntax for NLP applications.
There are several potential directions for future research. For example, further introducing dependency types (Tian et al. Reference Tian, Chen, Song and Wan2021) into the GCN variations and using external knowledge bases (Mihaylov and Frank Reference Mihaylov and Frank2018; Lin et al. Reference Lin, Chen, Chen and Ren2019a; Yang et al. Reference Yang, Wang, Liu, Liu, Lyu, Wu, She and Li2019) to mine latent relations appear to be of interest. To make full use of the prior knowledge for MR, we also plan to replace BERT with knowledge-enhanced PLMs (Zhang et al. Reference Zhang, Han, Liu, Jiang, Sun and Liu2019).
Acknowledgement
We thank the anonymous reviewers for their valuable comments. This work was supported by Shanghai Science and Technology Young Talents Sailing Program 21YF1413900, Fundamental Research Funds for the Central Universities (43800-20101-222340) and in part by the National Natural Science Foundation of China under grants 91746203, 61991410 and the National Key R&D Program of China under grant 2018AAA0102804.
A. Comparison of data size
Figure A.1. shows the performance of E-BERT, E-BERT+GCN and E-BERT+AGCN against different settings of data size. Given the SemEval dataset has fewer metonymic instances, we conduct the experiment on ReLocaR only. Using only 20% of training data, two models achieve desirable F1 scores near 90. The result demonstrates the robustness of E-BERT+GCN and E-BERT+AGCN models. When comparing them under the same data size setting, E-BERT+AGCN substantially outperforms E-BERT, and the performance gap between E-BERT+AGCN and E-BERT is always larger than 0.4%. The observation suggests that the E-BERT+AGCN model has better generalisation than E-BERT, especially for small datasets.