1 Introduction to the special issue
Machine learning, together with techniques borrowed from other related fields, such as statistics and optimization, has become indispensable for large part of Computational Linguistics and Natural Language Processing (NLP) research and applications. On the one hand, these techniques have enhanced systems' performance and have significantly sped up some development and knowledge-acquisition phases. On the other hand, their use requires careful parameter tuning and, above all, engineering of machine-based representations of natural language phenomena, e.g., by means of features that sometimes detach from the common sense interpretation of such phenomena. These difficulties become more marked when the input/output data have a structured and relational form: The designer has both to engineer features for representing the system input, e.g., the syntactic parse-tree of a sentence, and devise methods for generating the output, e.g., by building a set of classifiers that provide boundaries and types (semantic argument, function, or concept type) of some of the parse-tree constituents.
Research in empirical NLP has been tackling these complexities since the early work in the field. For instance, word sense tagging is a problem in which the input, i.e., word sequences, and output, i.e., sense tag sequences, are structured. However, the designed models were mainly based on local information, e.g., the model in Yarowsky (Reference Yarowsky1992) for word sense disambiguation. Other approaches attempted to improve locality by encoding rules capturing more global information. For example, Brill's POS tagger (Brill Reference Brill1992) learned and applied correcting rules expressing some global view of the sequence. More recently, the idea of global model is encoded by post-processing heuristics, which aim at correcting incompatible output – see some semantic role-labeling systems in Carreras and Màrquez (Reference Carreras and Màrquez2005). The use of such ad hoc solutions was mainly due to the lack of statistical and machine learning theory suggesting how models should be designed for capturing dependencies among the items in the structured data.
In contrast, recent work in machine learning has provided different paradigms to globally represent and process such data. The key ingredients of all these approaches are as follows: (i) They define how to score globally the appropriateness of complete output candidate structures (or hypotheses) for a given input; (ii) learning is performed to optimize a global objective function; (iii) features used to score the hypotheses can encode any aspect of the input and output structures (usually with some limitations on the output); and (iv) a search procedure (also known as decoding or inference) is defined over the space of possible output structures for finding the one with the highest score. For instance, we may find the following:
• Graphical models, which encode observations and labels as nodes of graphs, where the edges define dependencies among them, weighted with probabilities. Inference algorithms are defined over these graphs in order to find the most probable assignment to variables (i.e., defining the output structure) for a given input example. Learning consists in acquiring the weights of all model parameters. Conditional Random Fields (CRFs) are one of the most representative examples of such models (Lafferty, McCallum and Pereira Reference Lafferty, McCallum and Pereira2001).
• Non-probabilistic discriminative approaches, including global linear models such as the simple structured perceptron (Collins and Duffy Reference Collins and Duffy2002) and other more sophisticated Max-Margin learning algorithms, such as SVMstruct (Tsochantaridis et al. Reference Tsochantaridis, Joachims, Hofmann and Altun2005) and Max-Margin Markov Networks (Taskar, Guestrin and Koller Reference Taskar, Guestrin, Koller, Thrun, Saul and Schölkopf2004), which are extensions of the well-known Support Vector Machines (SVMs) (Vapnik Reference Vapnik1998) into the structure prediction scenario. Interestingly, all these methods are instances of kernel machines. By using structural kernels (e.g., convolution kernels) one can encode complex input data, such as the examples appearing in NLP problems (Shawe-Taylor and Cristianini Reference Shawe-Taylor and Cristianini2004).
However, the learning paradigms above may have drawbacks related to complexity and efficiency, which can limit their use in practical situations. For instance, parameter estimation and exact decoding can be computationally prohibitive for some real world models and applications.
In parallel with the previous research, alternative approaches have been studied. In particular, constrained conditional models and re-ranking. In the former the predictions of several locally trained classifiers are combined to score complete candidate solutions for a given input. Integer Linear Programming is usually used to efficiently find the best scored global solution under a set of structural and problem-specific constraints (e.g., Punyakanok et al. Reference Punyakanok, Roth, Yih and Zimak2005; Chang et al. Reference Chang, Ratinov, Rizzolo and Roth2008). The latter are based on the efficient coupling of basic local models for generating lists of hypotheses for every input and classifiers (re-rankers), which are used to select the best output structure from the list of hypotheses. Noticeably, such classifiers can exploit the global information (the dependencies among observations and labels) available in the hypotheses to learn the ranking task.
In summary, on the one hand, there is an interesting theory that needs to be further explored both to improve accuracy and efficiency of the algorithms. On the other hand, there are more practical approaches, based on simple classifiers, which are typically very fast and can achieve state-of-the-art results when accurately designed for specific tasks (Collins Reference Collins2000; Shen, Sarkar and Och Reference Shen, Sarkar and Och2004; Carreras and Màrquez Reference Carreras and Màrquez2005; Charniak and Johnson Reference Charniak and Johnson2005; Koo and Collins Reference Koo and Collins2005; Kudo, Suzuki and Isozaki Reference Kudo, Suzuki and Isozaki2005; Ge and Mooney Reference Ge and Mooney2006; Roark et al. Reference Roark, Liu, Harper, Stewart, Lease, Snover, Shafran, Dorr, Hale, Krasnyanskaya and Yung2006; Moschitti et al. Reference Moschitti, Quarteroni, Basili and Manandhar2007; Huang Reference Huang2008; Punyakanok, Roth and Yih Reference Punyakanok, Roth and Yih2008; Dinarelli, Moschitti and Riccardi Reference Dinarelli, Moschitti and Riccardi2009; Nguyen, Moschitti and Riccardi Reference Nguyen, Moschitti and Riccardi2009).
This special issue aimed at collecting and presenting extensions, findings (both empirical and theoretical), and new directions within the framework of the above research. In addition, we were also interested in the comparison of different learning paradigms for dealing with structured NLP data, as well as in the adaptation or generalization of already existing methods to novel NLP tasks or usages. As we will see in the following section, some of these aspects are effectively addressed by the papers presented in this special issue.
2 Content overview
The special issue callFootnote 1 received thirteen high-quality papers, out of which only five (for space reasons) could be included in the journal. Most of the excluded papers were recommended for publication in the Journal of Natural Language Engineering following its standard editorial process. Interestingly, the special issue call was coordinated with the TextGraphs 2011 workshop,Footnote 2 which also included a special theme on ‘Graphs in Structured Input/Output Learning’. Out of the five accepted papers, two of them are expanded versions of papers previously published at TextGraphs. These two were selected as the two best contributions from the workshop, according to the special issue topic, and invited to submit to the special issue. The acceptance process consisted of two-review rounds,Footnote 3 in which every paper was reviewed by three members of the guest editorial board, and one more round for preparing the camera-ready versions. The five papers selected by this highly accurate review process, address various aspects of NLP and machine learning. Document ranking, syntactic parsing and sequence labeling, and relation extraction between terms and entities were tackled with different structured input/output approaches. These ranged from re-ranking to joint constraint-based global models as described in what follows.
In the first paper, Villatoro et al. introduce a novel method to re-rank the list of documents returned by an Information Retrieval (IR) system. This approach is based on a Markov Random Field (MRF) model, which classifies the retrieved documents as relevant or irrelevant. The proposed MRF combines (i) information provided by the base IR system, (ii) similarities among documents in the retrieved list, and (iii) relevance feedback. The experiments show that the joint model can improve on the mean average precision used by state-of-the-art retrieval engines.
In the second paper, Søgaard illustrates an innovative approach to unsupervised dependency parsing. This is based on the assumption that a dependency structure is also a partial order on the nodes in terms of centrality or saliency. By exploiting this property, the author sorts input words by means of a ranking model and obtains the dependency parse. For this purpose, a simple deterministic parsing algorithm is applied, which relies on the universal dependency rules defined by Naseem et al. (Reference Naseem, Chen, Barzilay and Johnson2010). The resulting unsupervised parser is shown to be competitive to state-of-the-art unsupervised dependency parsers in a variety of languages.
In the third paper, Zhang et al. approach a rich sequence-labeling task, namely, supertagging by exploiting, in addition to the standard adjacent label features, long-range information. Capturing these long-distance dependencies proves to be very important for the task. Authors propose word-to-word dependencies derived from a dependency parser and long-range features encoded as soft constraints in the training. The tagger is learnt in a grammar-satisfying space and uses a CFG filter to impose grammar constraints for the model parameter's update. The experiments show that the proposed structure-guided supertaggers improve on the baseline models. The structured model also results in an increase of the F-score of the final parser, achieving a competitive performance at higher parsing speed than state-of-the-art HPSG parsers.
The fourth paper by Do and Roth proposes a machine learning approach, which, by exploiting existing resources, such as Wikipedia, determines the taxonomic relation between pairs of terms, e.g., car is an ancestor of Toyota Camry. The model is based on a global constraint-based inference process that leverages an existing knowledge base to enforce relational constraints among terms. The experiments show that the global model significantly outperforms other systems built upon existing knowledge sources.
Finally, Li et al. describe topic models jointly used with entity relation detection. The text between pairs of named entities is used to define mini-documents, which are characterized by topic distributions. The resulting system is based on Maximum Entropy Discriminant Latent Dirichlet Allocation (MedLDA) with mixed membership for relation detection. Such membership formulation enables the system to incorporate heterogeneous features. The experiments on standard relation extraction corpus show that the proposed system improves the baselines, i.e., SVMs and LDA, respectively.
Acknowledgments
The guest editors would like to express their gratitude to all contributing authors, the executive editor, Ruslan Mitkov, the editors and the staff of the Journal of Natural Language Engineering. We are especially grateful to Stephen Clark and Hwee Tou Ng, who agreed to be the representatives of the JNLE editorial board in our guest editorial board for the special issue, and to Irina Temnikova and Natalia Konstantinova, who greatly helped us during the whole process. Last but not least, the high quality reviewing and selection process could not have been possible without the help of our expert guest editorial board, composed by Roberto Basili, Ulf Brefeld, Razvan Bunescu, Nicola Cancedda, Xavier Carreras, Stephen Clark, Trevor Cohn, Walter Daelemans, Hal Daumé, Jason Eisner, James Henderson, Liang Huang, Richard Johansson, Terry Koo, Mirella Lapata, Yuji Matsumoto, Ryan McDonald, Raymond Mooney, Hwee Tou Ng, Sebastian Riedel, Dan Roth, Mihai Surdeanu, Ivan Titov, Kristina Toutanova, Jun'ichi Tsujii, Antal van den Bosch, Scott Yih, Fabio Massimo Zanzotto, and Min Zhang.
Guest editors' work is partially supported by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant numbers 247758 (EternalS), 288024 (LiMoSINe), 247762 (FAUST), and 247914 (MOLTO).