Natural Language Engineering: Volume 22 - Issue 6

Revisiting the ontologising of semantic relation arguments in wordnet synsets
HUGO GONÇALO OLIVEIRA, PAULO GOMES
Published online by Cambridge University Press:

22 July 2015, pp. 819-848
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Ontologising is the task of associating terms, in text, with an ontological representation of their meaning, in an ontology. In this article, we revisit algorithms that have previously been used to ontologise the arguments of semantic relations in a relationless thesaurus, resulting in a wordnet. For increased flexibility, the algorithms do not use the extraction context when selecting the most adequate synsets for each term argument. Instead, they exploit a term-based lexical network which can be established by knowledge extracted automatically, or obtained from the resource the relations are being ontologised to. On the latter idea, we made several experiments to conclude that the algorithms can be used both for wordnet creation and for their enrichment. Besides describing the algorithms with some detail, the aforementioned experiments, which target both English and Portuguese, and their results are reported and discussed.

ISO standard modeling of a large Arabic dictionary
AIDA KHEMAKHEM, BILEL GARGOURI, ABDELMAJID BEN HAMADOU, GIL FRANCOPOULO
Published online by Cambridge University Press:

07 September 2015, pp. 849-879
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
In this paper, we address the problem of the large coverage dictionaries of Arabic language usable both for direct human reading and automatic Natural Language Processing. For these purposes, we propose a normalized and implemented modeling, based on Lexical Markup Framework (LMF-ISO 24613) and Data Registry Category (DCR-ISO 12620), which allows a stable and well-defined interoperability of lexical resources through a unification of the linguistic concepts. Starting from the features of the Arabic language, and due to the fact that a large range of details and refinements need to be described specifically for Arabic, we follow a finely structuring strategy. Besides its richness in morphology, syntax and semantics knowledge, our model includes all the Arabic morphological patterns to generate the inflected forms from a given lemma and highlights the syntactic–semantic relations. In addition, an appropriate codification has been designed for the management of all types of relationships among lexical entries and their related knowledge. According to this model, a dictionary named El Madar1 has been built and is now publicly available on line. The data are managed by a user-friendly Web-based lexicographical workstation. This work has not been done in isolation, but is the result of a collaborative effort by an international team mainly within the ISO network during a period of eight years.

Modernising historical Slovene words
YVES SCHERRER, TOMAŽ ERJAVEC
Published online by Cambridge University Press:

03 August 2015, pp. 881-905
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
We propose a language-independent word normalisation method and exemplify it on modernising historical Slovene words. Our method relies on character-level statistical machine translation (CSMT) and uses only shallow knowledge. We present relevant data on historical Slovene, consisting of two (partially) manually annotated corpora and the lexicons derived from these corpora, containing historical word–modern word pairs. The two lexicons are disjoint, with one serving as the training set containing 40,000 entries, and the other as a test set with 20,000 entries. The data spans the years 1750–1900, and the lexicons are split into fifty-year slices, with all the experiments carried out separately on the three time periods. We perform two sets of experiments. In the first one – a supervised setting – we build a CSMT system using the lexicon of word pairs as training data. In the second one – an unsupervised setting – we simulate a scenario in which word pairs are not available. We propose a two-step method where we first extract a noisy list of word pairs by matching historical words with cognate modern words, and then train a CSMT system on these pairs. In both sets of experiments, we also optionally make use of a lexicon of modern words to filter the modernisation hypotheses. While we show that both methods produce significantly better results than the baselines, their accuracy and which method works best strongly correlates with the age of the texts, meaning that the choice of the best method will depend on the properties of the historical language which is to be modernised. As an extrinsic evaluation, we also compare the quality of part-of-speech tagging and lemmatisation directly on historical text and on its modernised words. We show that, depending on the age of the text, annotation on modernised words also produces significantly better results than annotation on the original text.

Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework
JOSEF ROBERT NOVAK, NOBUAKI MINEMATSU, KEIKICHI HIROSE
Published online by Cambridge University Press:

07 September 2015, pp. 907-938
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This paper provides an analysis of several practical issues related to the theory and implementation of Grapheme-to-Phoneme (G2P) conversion systems utilizing the Weighted Finite-State Transducer paradigm. The paper addresses issues related to system accuracy, training time and practical implementation. The focus is on joint n-gram models which have proven to provide an excellent trade-off between system accuracy and training complexity. The paper argues in favor of simple, productive approaches to G2P, which favor a balance between training time, accuracy and model complexity. The paper also introduces the first instance of using joint sequence RnnLMs directly for G2P conversion, and achieves new state-of-the-art performance via ensemble methods combining RnnLMs and n-gram based models. In addition to detailed descriptions of the approach, minor yet novel implementation solutions, and experimental results, the paper introduces Phonetisaurus, a fully-functional, flexible, open-source, BSD-licensed G2P conversion toolkit, which leverages the OpenFst library. The work is intended to be accessible to a broad range of readers.

Data-driven deep-syntactic dependency parsing †
MIGUEL BALLESTEROS, BERND BOHNET, SIMON MILLE, LEO WANNER
Published online by Cambridge University Press:

18 August 2015, pp. 939-974
- Article
- - You have access
- PDF
- HTML
- Export citation
‘Deep-syntactic’ dependency structures that capture the argumentative, attributive and coordinative relations between full words of a sentence have a great potential for a number of NLP-applications. The abstraction degree of these structures is in between the output of a syntactic dependency parser (connected trees defined over all words of a sentence and language-specific grammatical functions) and the output of a semantic parser (forests of trees defined over individual lexemes or phrasal chunks and abstract semantic role labels which capture the frame structures of predicative elements and drop all attributive and coordinative dependencies). We propose a parser that provides deep-syntactic structures. The parser has been tested on Spanish, English and Chinese.

Editor’s Preface to Emerging Trends
Published online by Cambridge University Press:

13 October 2016, p. 975
- Article
- - You have access
- PDF
- HTML
- Export citation

The next generation
KENNETH WARD CHURCH
Published online by Cambridge University Press:

13 October 2016, pp. 977-980
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
I’m sure you want me to tell you about the next new emerging trend, but I’m not going to do that. It is much easier to suggest where trends come from (the next generation), and how to distinguish passing fads (bubbles) from emerging trends. Young people are often the early adopters, the first to see what is about to happen, but most people don’t see what’s coming until well after the fact. Those with the most to lose (the establishment) tend to be the most resistant to change.

NLE volume 22 issue 6 Cover and Front matter
Published online by Cambridge University Press:

13 October 2016, pp. f1-f2
- Article
- - You have access
- PDF
- Export citation

NLE volume 22 issue 6 Cover and Back matter
Published online by Cambridge University Press:

13 October 2016, pp. b1-b7
- Article
- - You have access
- PDF
- Export citation

Natural Language Engineering

Refine listing

Actions for selected content:

Volume 22 - Issue 6 - November 2016

Articles

Revisiting the ontologising of semantic relation arguments in wordnet synsets

ISO standard modeling of a large Arabic dictionary

Modernising historical Slovene words

Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework

Data-driven deep-syntactic dependency parsing †

Editorial Note

Editor’s Preface to Emerging Trends

Emerging Trends

The next generation

Front Cover (OFC, IFC) and matter

NLE volume 22 issue 6 Cover and Front matter

Back Cover (IBC, OBC) and matter

NLE volume 22 issue 6 Cover and Back matter

Natural Language Engineering

Refine listing

Actions for selected content:

Save Search

Volume 22 - Issue 6 - November 2016

Articles

Editorial Note

Emerging Trends

Front Cover (OFC, IFC) and matter

Back Cover (IBC, OBC) and matter