A Semi-automatic and low-cost method to learn patterns for named entity recognition*

M. MARRERO; J. URBANO

doi:10.1017/S135132491700016X

A Semi-automatic and low-cost method to learn patterns for named entity recognition*

Published online by Cambridge University Press: 15 June 2017

M. MARRERO and

J. URBANO

Show author details

M. MARRERO: Affiliation:
Barcelona Supercomputing Center, Carrer de Jordi Girona, 29-31, 08034 Barcelona, Spain e-mail: [email protected]
J. URBANO: Affiliation:
Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands e-mail: [email protected]

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is often difficult, and yet there is little research devoted to methods capable of learning human-readable patterns, possibly with arbitrary sets of features. In this paper, we present a semi-automatic method to generate both regular expressions and a subset of the JAPE language. It does not need a corpus annotated beforehand. Instead, it employs active learning and combines clustering with an algorithm that finds alignments between symbols present in the entities discovered during the learning process. The method currently supports a fixed set of character features and an arbitrary set of token features, but it can incorporate other kinds of features as well. Through several experiments with an English corpus, we show the ability of the method to generate effective patterns at a low annotation cost, and how it can successfully help in the annotation of brand new corpora.

Type: Articles
Information: Natural Language Engineering , Volume 24 , Issue 1 , January 2018 , pp. 39 - 75

DOI: https://doi.org/10.1017/S135132491700016X [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

This work was partially supported by the Spanish Government through a Juan de la Cierva fellowship and project MDM-2015-0502. We specially thank Jorge Morato and Sonia Sánchez for their advice, as well as the anonymous reviewers for their suggestions.

References

Alfonseca, E., and Manandhar, S. 2002. An unsupervised method for general named entity recognition and automated concept discovery. In Proceedings of the 1st International Conference on General WordNet, Mysore, India, pp. 34–43.Google Scholar

Appelt, D. E., and Onyshkevych, B. 1998. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, Baltimore, Maryland, pp. 23–30.Google Scholar

Asahara, M., and Matsumoto, Y. 2003. Japanese named entity extraction with redundant morphological analysis. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Canada: Edmonton, vol. 1, pp. 8–15.Google Scholar

Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. 1997. Nymble: a high-performance learning name-finder. In Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC, pp. 194–201.Google Scholar

Boguraev, B. K. 2004. Annotation-based finite state processing in a large-scale NLP architecture. In , Nikolov et al. (eds.), Recent Advances in Natural Language Processing III: Selected Papers from RANLP 2003, John Benjamins Publishing, Amsterdam, pp. 61–77.CrossRef Google Scholar

Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the 6th Workshop on Very Large Corpora, Montreal, Canada, pp. 152–160.Google Scholar

Brauer, F., Rieger, R., Mocan, A., and Barczynski, W. M. 2011. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th Conference on Information and Knowledge Management, Glasgow, United Kindgdom, pp. 1285–94.Google Scholar

Califf, M. E. 1998. Relational Learning Techniques for Natural Language Information Extraction. PhD Thesis, The University of Texas at Austin.Google Scholar

Chiticariu, L., Krishnamurthy, R., Li, Y., Reiss, F., and Vaithyanathan, S. 2010. Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Massachusetts, USA, pp. 1002–12.Google Scholar

Chiticariu, L., and Reiss, F. R. 2013. Rule-based information extraction is dead! Long live rule-based information extraction systems! In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, USA, pp. 827–32.Google Scholar

Ciravegna, F., and Wilks, Y. 2003. Designing adaptive information extraction for the semantic web in amilcare. In Handschuh, S., and Staab, S., , S. (eds.), Annotation for the Semantic Web, Frontiers in Artificial Intelligence and Applications series, vol. 96, pp. 112–27. IOS Press.Google Scholar

Culotta, A., and Mccallum, A. 2005. Reducing labeling effort for structured prediction tasks. In Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, Pennsylvania, pp. 746–51.Google Scholar

Cunningham, H., et al. 2013. Developing language processing components with GATE (a user gGuide). Technical Report, University of Sheffield Department of Computer Science.Google Scholar

Day, W. H. E., and Edelsbrunner, H., 1984. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification 1 (1): 7–24.Google Scholar

Drozdzynski, W., Krieger, H.-U., Piskorski, J., Schäfer, U., and Xu, F., 2004. Shallow processing with unification and typed feature structures: foundations and applications. Künstliche Intelligenz 1 (1): 17–23.Google Scholar

Etzioni, O., et al. 2005. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence 165 (1): 91–134.Google Scholar

Fersini, E., Messina, E., Felici, G., and Roth, D., 2014. Soft-constrained inference for named entity recognition. Information Processing and Management 50 (5): 807–19.Google Scholar

Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor, Michigan, pp. 363–70.Google Scholar

Freitag, D. 1998. Toward general-purpose learning for information extraction retargetability. In Proceedings of the 17th International Conference on Computational Linguistics, Montreal, Canada, pp. 404–8.Google Scholar

Gantz, J., and Reinsel, D. 2012. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. Technical Report, IDC.Google Scholar

Gupta, S., and Manning, C. D. 2014. Improved pattern learning for bootstrapped entity extraction. In Proceedings of the 18th Conference on Computational Natural Language Learning, Baltimore, USA, pp. 98–108.Google Scholar

Hachey, B., Alex, B., and Becker, M. 2005. Investigating the effects of selective sampling on the annotation task. In Proceedings of the 9th Conference on Computational Natural Language Learning, Ann Arbor, Michigan, pp. 144–51.Google Scholar

Haertel, R. A., Seppi, K. D., Ringger, E. K., and Carroll, J. L. 2008. Return on investment for active learning. NIPS Workshop on Cost-Sensitive Learning.Google Scholar

Irmak, U., and Kraft, R. 2010. A scalable machine-learning approach for semi-structured named entity recognition. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, USA, pp. 461–70.Google Scholar

Jones, R. 2005. Learning to Extract Entities from Labelled and Unlabelled Text. PhD Thesis, Carnegie Mellon University.Google Scholar

Kazama, J., and Torisawa, K. 2007. A new perceptron algorithm for sequence labeling with non-local features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Prague, Czech Republic, pp. 315–24.Google Scholar

Kluegl, P., Toepfer, M., Beck, P.-D., Fette, G., and Puppe, F. 2015. Uima ruta: rapid development of rule-based information extraction applications. Natural Language Engineering 22 (1), 1–40.Google Scholar

Lavelli, A., Califf, M. E., Ciravegna, F., Freitag, D., Giuliano, C., Kushmerick, N., and Romano, L. 2004. IE evaluation: criticisms and recommendations. In AAAI Workshop on Adaptive Text Extraction and Mining, San Jose, California.Google Scholar

Levenshtein, V. I., 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10 (8): 707–10.Google Scholar

Li, Y., Bontcheva, K., and Cunningham, H., 2009. Adapting SVM for data sparseness and imbalance: a case study in information extraction. Natural Language Engineering 15 (2): 241–71.Google Scholar

Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., and Jagadish, H. 2008. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Waikiki, Hawaii, pp. 21–30.Google Scholar

Liu, X., Wei, F., Zhang, S., and Zhou, M., 2013. Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology 4 (1): 3.Google Scholar

Maedche, A., and Staab, S., 2001. Ontology learning for the semantic web. IEEE Intelligent Systems 16 (2): 72–9.Google Scholar

Marrero, M., Sánchez-Cuadrado, S., Morato, J., and Andreadakis, G., 2009. Evaluation of named entity extraction systems. Research in Computing Science 41: 47–58.Google Scholar

Marrero, M., Sánchez-Cuadrado, S., Urbano, J., Morato, J., and Moreiro, J. A. 2012. Information retrieval systems adapted to the biomedical domain. arXiv:1203.6845 [cs.CL].Google Scholar

Marrero, M., and Urbano, J. 2015. Information Extraction Grammars. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.), Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol. 9022. Springer, Cham.Google Scholar

Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., and Gómez-Berbís, J. M., 2013. Named entity recognition: fallacies, challenges and opportunities. Journal of Computer Standards and Interfaces 35 (5): 482–9.Google Scholar

McCallum, A., and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning, Edmonton, Canada, pp. 188–91.Google Scholar

Nadeau, D. 2007. Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. PhD Thesis, School of Information Technology and Engineering, University of Ottawa.Google Scholar

Nagesh, A., and Chiticariu, L. 2012. Towards efficient named-entity rule induction for customizability. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Jeju Island, Korea, pp. 128–38.Google Scholar

Nédellec, C., et al. 2013. Overview of BioNLP shared task 2013. ACL Workshop on BioNLP, Sofia, Bulgaria, pp. 1–7.Google Scholar

Nouvel, D., Antoine, J. Y., Friburger, N., and Soulet, A. 2012. Coupling knowledge-based and data-driven systems for named entity recognition. In Proceedings of the ACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, Avignon, France, pp. 69–77.Google Scholar

Pang, B., and Lee, L., 2007. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2): 1–135.Google Scholar

Pasca, M., Lin, D., Bigham, J., Lifchits, A., and Jain, A. 2006. Organizing and searching the world wide web of facts-step one: the one million fact extraction challenge. In Proceedings of the 21st National Conference on Artificial Intelligence, Boston, Massachusetts, pp. 1400–5.Google Scholar

Popescu, A.-M., and Etzioni, O. 2005. Extracting product features and opinions from reviews. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 339–46.Google Scholar

Ratinov, L., and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Conference on Natural Language Learning, Boulder, Colorado, pp. 147–55.Google Scholar

Reeve, L. H., and Han, H. 2005. Survey of Semantic Annotation Platforms. ACM Symposium on Applied Computing, Santa Fe, USA, pp. 1634–8.Google Scholar

Rinaldi, F., et al. 2005. CAFETIERE: conceptual annotations for facts, events, terms, individual entities, and RElations. Technical Report TR-U4.3.1, Parmenides Project IST-2001-39023.Google Scholar

Ringger, E., et al. 2008. Assessing the costs of machine-assisted corpus annotation through a user study. In Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, Morocco, pp. 3318–24.Google Scholar

Ritter, A., Clark, S., and Etzioni, O. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Edinburgh, United Kingdom, pp. 1524–34.Google Scholar

Sarawagi, S., 2008. Information extraction. Foundations and Trends in Databases 1 (3): 261–377.Google Scholar

Sekine, S., Grishman, R., and Shinnou, H. 1998. A decision tree method for finding and classifying names in japanese texts. In Proceedings of the 6th Workshop on Very Large Corpora, Montreal, Canada, pp. 171–8.Google Scholar

Settles, B., 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6 (1): 1–114.Google Scholar

Shen, D., Zhang, J., Su, J., Zhou, G., and Tan, C.-L. 2004. Multi-criteria-based active learning for named entity recognition. In Proceedings of the Annual Meeting of the ACL, Barcelona, Spain, pp. 589–96.Google Scholar

Shinyama, Y., and Sekine, S. 2004. Named entity discovery using comparable news articles. In Proceedings of the International Conference on Computational Linguistics, Geneva, Switzerland, p. 848.Google Scholar

Silberztein, M. 2005. NooJ: a linguistic annotation system for corpus processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, pp. 10–11.Google Scholar

Siniakov, P. 2008. GROPUS-an Adaptive Rule Based Algorithm for Information Extraction. PhD Thesis, Free University of Berlin.Google Scholar

Soderland, S., 1999. Learning information extraction rules for semi-structured and free text. Machine Learning 34 (1): 233–72.Google Scholar

Srihari, R. K., and Li, W. 1999. Information extraction supported question answering. Technical Report, Cymfony Inc.Google Scholar

Srikant, R., and Agrawal, R. 1996. Mining sequential patterns: generalizations and performance improvements. In Proceedings of the International Conference on Extending Database, Avignon, France, pp. 1–17.Google Scholar

Thompson, C. A., Califf, M. E., and Mooney, R. J. 1999. Active learning for natural language parsing and information extraction. In Proceedings of the International Conference on Machine Learning, Bled, Slovenia, pp. 406–14.Google Scholar

Tomanek, K., Wermter, J., and Hahn, U. 2007. An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Prague, Czech Republic, pp. 486–5.Google Scholar

Uren, V. S., et al. 2006. Semantic annotation for knowledge management: requirements and a survey of the state of the art. Journal of Web Semantics 4 (1): 14–28.Google Scholar

Vijayanarasimhan, S., and Grauman, K. 2009. What’s it going to cost you? Predicting effort versus informativeness for multi-label image annotations. In Proceedings of the Confernce on Computer Vision and Pattern Recognition, Miami, Florida, pp. 2262–9.Google Scholar

Vlachos, A., 2008. A stopping criterion for active learning. Computer Speech & Language 22 (3): 295–312.Google Scholar

Wu, T., and Pottenger, W. M., 2005. A semi-supervised active learning algorithm for information extraction from textual data. Journal of the American Society for Information Science and Technology 56 (3): 258–71.Google Scholar

Article contents

A Semi-automatic and low-cost method to learn patterns for named entity recognition*

Abstract

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests