Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes†

JAN KOCOŃ; MICHAŁ MARCIŃCZUK

doi:10.1017/S1351324916000255

Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes†

Published online by Cambridge University Press: 27 September 2016

JAN KOCOŃ and

MICHAŁ MARCIŃCZUK

Show author details

JAN KOCOŃ: Affiliation:
Department of Computational Intelligence, Wrocław University of Technology, Wybrzeże Wyspiańskiego 27, Wrocław, Poland e-mails: [email protected], [email protected]
MICHAŁ MARCIŃCZUK: Affiliation:
Department of Computational Intelligence, Wrocław University of Technology, Wybrzeże Wyspiańskiego 27, Wrocław, Poland e-mails: [email protected], [email protected]

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

A key challenge of the Information Extraction in Natural Language Processing is the ability to recognise and classify temporal expressions (timexes). It is a crucial source of information about when something happens, how often something occurs or how long something lasts. Timexes extracted automatically from text, play a major role in many Information Extraction systems, such as question answering or event recognition. We prepared a broad specification of Polish timexes – PLIMEX. It is based on the state-of-the-art annotation guidelines for English, mainly TIMEX2 and TIMEX3 (a part of TimeML – Markup Language for Temporal and Event Expressions). We have expanded our specification for a description of the local meaning of timexes, based on LTIMEX annotation guidelines for English. Temporal description supports further event identification and extends event description model, focussing on anchoring events in time, events ordering and reasoning about the persistence of events. We prepared the specification, which is designed to address these issues, and we annotated all documents in Polish Corpus of Wroclaw University of Technology (KPWr) using our annotation guidelines. We also adapted our Liner2 machine learning system to recognise Polish timexes and we propose two-phase method to select a subset of features for Conditional Random Fields sequence labelling method. This article presents the whole process of corpus annotation, evaluation of inter-annotator agreement, extending Liner2 system with new features and evaluation of the recognition models before and after feature selection with the analysis of statistical significance of differences. Liner2 with presented models is available as open source software under the GNU General Public License.

Type: Articles
Information: Natural Language Engineering , Volume 23 , Issue 3 , May 2017 , pp. 385 - 418

DOI: https://doi.org/10.1017/S1351324916000255 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

Work financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education.

References

Allen, J. 1995. Natural Language Understanding. Redwood City, CA, USA: Benjamin Cummings.Google Scholar

Andersen, P. M., Hayes, P. J., Huettner, A. K., Schmandt, L. M., Nirenburg, I. B., and Weinstein, S. P. 1992. Automatic extraction of facts from press releases to generate news stories. In Proceeding of the 3rd Conference on Applied Natural Language Processing, ANLC. Trento, Italy: Association for Computational Linguistics, pp. 170–7.Google Scholar

Benthem, J. 1983. The Logic of Time: A Model-Theoretic Investigation into the Varieties of Temporal Ontology and Temporal Discourse. Dordrecht, London, Boston: D. Reidel.CrossRef Google Scholar

Bethard, S. 2013. ClearTK-TimeML: A minimalist approach to TempEval 2013. In Second Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation, SemEval. Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 10–14.Google Scholar

Blum, A. L. and Langley, P. 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence 97 (1–2): 245–71.Google Scholar

Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., and Wardyński, A. 2012. KPWr: Towards a Free Corpus of Polish. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC. Istanbul, Turkey: European Language Resources Association (ELRA), pp. 3218–22.Google Scholar

Busemann, S., Declerck, T., Diagne, A. K., Dini, L., Klein, J., and Schmeier, S. 1997. Natural language dialogue service for appointment scheduling agents. In Proceedings of the 5th Conference on Applied Natural Language Processing, ANLC. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 25–32.Google Scholar

Chinchor, N. A. 1998. MUC-7 test scores introduction (Appendix B). In Proceedings of the 7th Message Understanding Conference, Fairfax, VA: Association for Computational Linguistics.Google Scholar

Daniel, N., Radev, D., and Allison, T. 2003. Sub-event based multi-document summarization. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop, HLT-NAACL-DUC. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 9–16.Google Scholar

Dietterich, T. G. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10 (7): 1895–923.Google Scholar

Ferro, L. 2001. Instruction manual for the annotation of temporal expressions. MITRE Technical Report. MITRE Washington C3 Center, McLean, Virginia.Google Scholar

Filatova, E., and Hovy, E. 2001. Assigning time-stamps to event-clauses. In Proceedings of the Workshop on Temporal and Spatial Information Processing - Volume 13, TASIP. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1–8.Google Scholar

Han, B., Gates, D., and Levin, L. 2006. Understanding temporal expressions in emails. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 136–43.Google Scholar

Hou, C. and Jiao, L. 2010. Selecting features of linear-chain conditional random fields via greedy stage-wise algorithms. Pattern Recognition Letters 31 (2): 151–62.Google Scholar

Hripcsak, G. and Rothschild, A. S. 2005. Agreement, the f-measure and reliability in information retrieval. Journal of the American Medical Informatics Association 12 (3): 296–8.Google Scholar

Kędzia, P., Piasecki, M., Kocoń, J., and Indyka-Piasecka, A. 2014. Distributionally extended network-based word sense disambiguation in semantic clustering of Polish texts. IERI Procedia 10 (1): 38–44.CrossRef Google Scholar

Kocoń, J. and Marcińczuk, M. 2015. Recognition of Polish temporal expressions. In Proceedings of the Recent Advances in Natural Language Processing, RANLP. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 282–90.Google Scholar

Kohavi, R. and John, G. H. 1997. Wrappers for feature subset selection. Artificial Intelligence 97 (1–2): 273–324.Google Scholar

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, ICML. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., pp. 282–9.Google Scholar

Li, D., Kipper-Schuler, K., and Savova, G. 2008. Conditional random Fields and support vector machines for disorder named entity recognition in clinical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP. Columbus, Ohio. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 94–95.Google Scholar

Llorens, H., Saquete, E. and Navarro-Colorado, B. 2010a. TimeML events recognition and classification: learning CRF models with semantic roles. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 725–33.Google Scholar

Llorens, H., Saquete, E. and Navarro-Colorado, B. 2010b. TIPSem (English and Spanish): evaluating CRFs and semantic roles in TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 284–91.Google Scholar

Llorens, H., Saquete, E. and Navarro-Colorado, B. 2013. Applying semantic knowledge to the automatic processing of temporal expressions and events in natural language. Information Processing & Management 49 (1): 179–197.CrossRef Google Scholar

Mani, I. and Wilson, G. 2000. Robust temporal processing of news. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. ACL, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 69–76.Google Scholar

Marcińczuk, M., Kocoń, J. and Broda, B. 2012. Inforex – a web-based tool for text corpus management and semantic annotation. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC. Istanbul, Turkey: European Language Resources Association (ELRA), pp. 224–30.Google Scholar

Marcińczuk, M., Kocoń, J. and Janicki, M. 2013. Liner2 – a customizable framework for proper names recognition for Polish. In Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence. Berlin: Springer Verlag, pp. 231–53.CrossRef Google Scholar

Marcińczuk, M. and Kocoń, J. 2013. Recognition of named entities boundaries in Polish texts. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, ACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 94–9.Google Scholar

Maziarz, M., Piasecki, M., Rudnicka, E. and Szpakowicz, S. 2013. Beyond the transfer-and-merge wordnet construction: plWordNet and a comparison with WordNet. In Proceedings of the Recent Advances in Natural Language Processing, RANLP. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 443–52.Google Scholar

Mazur, P. 2012. Broad-Coverage Rule-Based Processing of Temporal Expressions. PhD Thesis. Wrocław: Politechnika Wrocławska.Google Scholar

Mizobuchi, S., Sumitomo, T., Fuketa, M. and Aoe, J.-I. 1998. A method for understanding time expressions. In IEEE International Conference on Systems, Man, and Cybernetics, SMC. San Diego, CA, pp. 1151–5.Google Scholar

Negri, M. and Marseglia, L. 2005. Recognition and normalization of time expressions: ITC-irst at TERN 2004. Technical Report. Developing Multilingual Web-scale Language Technologies.Google Scholar

Niemi, J. and Koskenniemi, K. 2007. Representing calendar expressions with finite-state transducers that bracket periods of time on a hierarchical timeline. In Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007, NODALIDA. Estonia, Tartu: University of Tartu, pp. 355–62.Google Scholar

Piasecki, M., Maziarz, M., Szpakowicz, S. and Rudnicka, E. 2014. PlWordNet as the cornerstone of a toolkit of Lexico-semantic resources. In Proceedings of the 7th International Global Wordnet Conference, ACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 304–12.Google Scholar

Pustejovsky, J., Ingria, B., Sauri, R., Castano, J., Littman, J., Gaizauskas, R., Setzer, A., Katz, G., and Mani, I. 2005a The specification language TimeML. The Language of Time: A Reader, 545–57. Oxford University Press.Google Scholar

Pustejovsky, J., Knippen, R., Littman, J. and Saurí, R. 2005b. Temporal and event information in natural language text. Language Resources and Evaluation 39 (2–3): 123–64.CrossRef Google Scholar

Radziszewski, A., Maziarz, M. and Wieczorek, J. 2012. Shallow syntactic annotation in the Corpus of Wrocław University of Technology. Cognitive Studies 12 (1): 129–47.Google Scholar

Saquete, E., Muñoz, R., and Martínez-Barco, P. 2003. TERSEO: temporal expression resolution system applied to event ordering. In Preceedings of Text, Speech and Dialogue, Lecture Notes in Computer Science. Berlin: Springer Verlag, pp. 220–8.Google Scholar

Saurí, R., Littman, J., Gaizauskas, R., Setzer, A., and Pustejovsky, J. 2006. TimeML Annotation Guidelines, Version 1.2.1. http://www.timeml.org/site/publications/timeMLdocs/annguide_1.2.1.pdf Google Scholar

Schilder, F. 2004. Extracting meaning from temporal nouns and temporal prepositions. ACM Transactions on Asian Language Information Processing (TALIP) 3 (1): 33–50.Google Scholar

Schilder, F. and Habel, C. 2001. From temporal expressions to temporal information: semantic tagging of news messages. In Proceedings of the ACL-2001 Workshop on Temporal and Spatial Information Processing, ACL. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 65–72.Google Scholar

Skukan, L., Glavas, G. and Snajder, J. 2014. HEIDELTIME.HR: extracting and normalizing temporal expressions in Croatian. In Proceedings of the 9th Slovenian Language Technologies Conferences, IS-LT. Slovenia, Ljubljana: Information Society, pp. 99–103.Google Scholar

Smith, C. S. 2010. Temporal structures in discourse. In Text, Time, and Context. Studies in linguistics and philosophy, vol. 87. Netherlands: Springer, pp. 285–302.Google Scholar

Strötgen, J., Zell, J., and Gertz, M. 2013. HeidelTime: tuning english and developing Spanish resources for TempEval-3. In 2nd Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), SemEval. Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 15–19.Google Scholar

Strötgen, J. and Gertz, M. 2013. Multilingual and cross-domain temporal tagging. Language Resources and Evaluation 47 (2): 269–98.Google Scholar

Strötgen, J. and Gertz, M. 2015. A baseline temporal tagger for all languages. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP. Lisbon, Portugal. Association for Computational Linguistics, pp. 541–547.Google Scholar

UzZaman, N., and Allen, J. 2010. TRIPS and TRIOS system for TempEval-2: extracting temporal information from text. In Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, pp. 276–283.Google Scholar

UzZaman, N., Llorens, H., Allen, J. F., Derczynski, L., Verhagen, M., and Pustejovsky, J. 2012. TempEval-3: evaluating events, time expressions and temporal relations. Computing Research Repository, abs/1206.5333.Google Scholar

UzZaman, N., Llorens, H., Derczynski, L., Verhagen, M., Allen, J., and Pustejovsky, J. 2013. SemEval-2013 Task 1: TEMPEVAL-3: evaluating time expressions, events and temporal relations. In 2nd Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval. Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 1–9.Google Scholar

Vicente-Diez, M. T., Samy, D., and Martinez, P. 2008. An empirical approach to a preliminary successful identification and resolution of temporal expressions in Spanish news corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC. European Language Resources Association (ELRA), pp. 2153–8.Google Scholar

Zhu, X. 2010. Conditional Random Fields. CS769 Advanced Natural Language Processing. http://pages.cs.wisc.edu/~jerryzhu/cs769/CRF.pdf Google Scholar

Article contents

Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes†

Abstract

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests