Hostname: page-component-586b7cd67f-l7hp2 Total loading time: 0 Render date: 2024-11-24T20:13:32.488Z Has data issue: false hasContentIssue false

A scalable architecture for data-intensive natural language processing

Published online by Cambridge University Press:  09 May 2017

ZUHAITZ BELOKI
Affiliation:
IXA NLP Group, University of the Basque Country (UPV/EHU), Donostia-San Sebastián e-mail: [email protected], [email protected], [email protected]
XABIER ARTOLA
Affiliation:
IXA NLP Group, University of the Basque Country (UPV/EHU), Donostia-San Sebastián e-mail: [email protected], [email protected], [email protected]
AITOR SOROA
Affiliation:
IXA NLP Group, University of the Basque Country (UPV/EHU), Donostia-San Sebastián e-mail: [email protected], [email protected], [email protected]

Abstract

Computational power needs have greatly increased during the last years, and this is also the case in the Natural Language Processing (NLP) area, where thousands of documents must be processed, i.e., linguistically analyzed, in a reasonable time frame. These computing needs have implied a radical change in the computing architectures and big-scale text processing techniques used in NLP. In this paper, we present a scalable architecture for distributed language processing. The architecture uses Storm to combine diverse NLP modules into a processing chain, which carries out the linguistic analysis of documents. Scalability requires designing solutions that are able to run distributed programs in parallel and across large machine clusters. Using the architecture presented here, it is possible to integrate a set of third-party NLP modules into a unique processing chain which can be deployed onto a distributed environment, i.e., a cluster of machines, so allowing the language-processing modules run in parallel. No restrictions are placed a priori on the NLP modules apart of being able to consume and produce linguistic annotations following a given format. We show the feasibility of our approach by integrating two linguistic processing chains for English and Spanish. Moreover, we provide several scripts that allow building from scratch a whole distributed architecture that can be then easily installed and deployed onto a cluster of machines. The scripts and the NLP modules used in the paper are publicly available and distributed under free licenses. In the paper, we also describe a series of experiments carried out in the context of the NewsReader project with the goal of testing how the system behaves in different scenarios.

Type
Articles
Copyright
Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

This work has been partially funded by the NewsReader (FP7-ICT-2011-8-316404) project. Zuhaitz Beloki’s work is funded by a PhD grant from the University of the Basque Country.

References

Agerri, R., Aldabe, I., Beloki, Z., Laparra, E., Rigau, G., Soroa, A., van Erp, M., Fokkens, A., Ilievski, F., Izquierdo, R., Morante, R., van Son, C., Vossen, P., and Minard, A.-L. 2016. Event detection, version 3. NewsReader Deliverable 4.2.3.Google Scholar
Agerri, R., Artola, X., Beloki, Z., Rigau, G., and Soroa, A., 2015. Big data for natural language processing: a streaming approach. Knowledge-Based Systems 79: 3642.CrossRefGoogle Scholar
Agerri, R., Bermudez, J., and Rigau, G. 2014. IXA Pipeline: efficient and ready to use multilingual NLP tools. In Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), Reykjavik, Iceland.Google Scholar
Agerri, R., and Rigau, G. (2016). Robust multilingual named entity recognition with shallow semi-supervised features. Artificial Intelligence 238: 6382.CrossRefGoogle Scholar
Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., and Zdonik, S. 2003. Scalable distributed stream processing. In CIDR 2003 – First Biennial Conference on Innovative Data Systems Research, Asilomar, CA.Google Scholar
Cunningham, H., 2002. Gate, a general architecture for text engineering. Computers and the Humanities 36 (2): 223–54.CrossRefGoogle Scholar
Dean, J., and Ghemawat, S., 2008. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51 (1): 107–13.CrossRefGoogle Scholar
Derivière, J., Hamon, T., and Nazarenko, A. 2006. A scalable and distributed nlp architecture for web document annotation. In Advances in Natural Language Processing, pp. 5667. Springer.CrossRefGoogle Scholar
Epstein, E. A., Schor, M. I., Iyer, B. S., Lally, A., Brown, E. W., and Cwiklik, J., 2012. Making watson fast. IBM Journal of Research and Development 56 (3): 15.CrossRefGoogle Scholar
Evans, N., Asahara, M., and Matsumoto, Y., 2008. Cocytus: parallel NLP over disparate data. TAL 49 (2): 271–93.Google Scholar
Exner, P., and Nugues, P. 2014. KOSHIK: a large-scale distributed computing framework for NLP. In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods, pp. 463–70.Google Scholar
Fokkens, A., Soroa, A., Beloki, Z., Ockeloen, N., Rigau, G., van Hage, W. R., and Vossen, P. 2014. NAF and GAF: linking linguistic annotations. In Proceedings of 10th Joint ACL/ISO Workshop on Interoperable Semantic Annotation (ISA-10).Google Scholar
Ide, N., Romary, L., and de La Clergerie, É. V. 2003. International standard for a linguistic annotation framework. In Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS). Association for Computational Linguistics.CrossRefGoogle Scholar
Nesi, P., Pantaleo, G., and Sanesi, G. 2015. A distributed framework for NLP-based keyword and keyphrase extraction from web pages and documents. In Proceedings of the 21st International Conference on Distributed Multimedia Systems DMS '15, Hyatt Regency.Google Scholar
Otero, G., Pichel, J., García, M., Abuín, J. M., and Fernández, T., 2014. Análisis morfosintáctico y clasificación de entidades nombradas en un entorno Big Data. Procesamiento del Lenguaje Natural 53: 1724.Google Scholar
Padró, L., and Stanilovsky, E. 2012. Freeling 3.0: towards wider multilinguality. In Proceedings of the Language Resources and Evaluation Conference (LREC '12), Istanbul, Turkey, ELRA.Google Scholar
Padró, L., and Turmo, J. 2015. Textserver: cloud-based multilingual natural language processing. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDMW), IEEE, pp. 1636–39.Google Scholar
Padró, L., and Turmo, J., 2015. Textserver: cloud-based multilingual natural language processing. In Proceedings of the 15th IEEE International Conference on Data Mining Workshop (ICDMW '15), Atlantic City, USA, IEEE, pp. 1636–39.Google Scholar
Tablan, V., Roberts, I., Cunningham, H., and Bontcheva, K. 2012. GATECloud.net: a platform for large-scale, open-source text processing on the cloud. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical, and Engineering Sciences 371 (1983).Google ScholarPubMed
Wu, H., Fei, Z., Dai, A., Sammons, M., Roth, D., and Mayhew, S. D. 2014. Illinoiscloudnlp: text analytics services in the cloud. In Proceedings of International Conference on Language Resources and Evaluation (LREC), pp. 14–21.Google Scholar