Published online by Cambridge University Press: 27 September 2016
A key challenge of the Information Extraction in Natural Language Processing is the ability to recognise and classify temporal expressions (timexes). It is a crucial source of information about when something happens, how often something occurs or how long something lasts. Timexes extracted automatically from text, play a major role in many Information Extraction systems, such as question answering or event recognition. We prepared a broad specification of Polish timexes – PLIMEX. It is based on the state-of-the-art annotation guidelines for English, mainly TIMEX2 and TIMEX3 (a part of TimeML – Markup Language for Temporal and Event Expressions). We have expanded our specification for a description of the local meaning of timexes, based on LTIMEX annotation guidelines for English. Temporal description supports further event identification and extends event description model, focussing on anchoring events in time, events ordering and reasoning about the persistence of events. We prepared the specification, which is designed to address these issues, and we annotated all documents in Polish Corpus of Wroclaw University of Technology (KPWr) using our annotation guidelines. We also adapted our Liner2 machine learning system to recognise Polish timexes and we propose two-phase method to select a subset of features for Conditional Random Fields sequence labelling method. This article presents the whole process of corpus annotation, evaluation of inter-annotator agreement, extending Liner2 system with new features and evaluation of the recognition models before and after feature selection with the analysis of statistical significance of differences. Liner2 with presented models is available as open source software under the GNU General Public License.
Work financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education.