Hostname: page-component-586b7cd67f-2plfb Total loading time: 0 Render date: 2024-11-25T14:45:39.789Z Has data issue: false hasContentIssue false

DEXTER: A workbench for automatic term extraction with specialized corpora

Published online by Cambridge University Press:  05 October 2017

CARLOS PERIÑAN-PASCUAL*
Affiliation:
Applied Linguistics Department, Universitat Politècnica de València, Paranimf, 1; 46730 Gandia, Valencia, Spain

Abstract

Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.

Type
Articles
Copyright
Copyright © Cambridge University Press 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.

References

Ahmad, K., Gillam, L., and Tostevin, L. 2000. Weirdness indexing for logical document extrapolation and retrieval (WILDER). In E. M. Voorhees, and D. K Harman (eds.), Proceedings of the 8th Text Retrieval Conference, pp. 717–724. Washington: National Institute of Standards and Technology.Google Scholar
Ahrenberg, L. 2009. Term extraction: A review. Retrieved from http://www.ida.liu.se/~lah/Publications/tereview_v2.pdf Google Scholar
Alajmi, A., Saad, E. M., and Darwish, R. R., 2012. Toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 46 (8): 813.Google Scholar
Asubiaro, T. V., 2013. Entropy-based generic stopwords list for Yoruba texts. Int. J. Comput. Inform. Technol. 2 (5): 10651068.Google Scholar
Barcala, M., Domínguez-Noya, E., Gamallo, P., López, M., Moscoso, E., Rojo, G., Santalla, P., and Sotelo, S. 2007. A corpus and lexical resources for multi-word terminology extraction in the field of economy. In Proceedings of the 3rd Language and Technology Conference, Poznan, pp. 355–359.Google Scholar
Biemann, C., Heyer, G., Quasthoff, U., and Richter, M. 2007. The Leipzig Corpora Collection: monolingual corpora of standard size. In Proceedings of Corpus Linguistic 2007, Birmingham.Google Scholar
Brants, T. 2004. Natural language processing in information retrieval. In Proceedings of the 14th Meeting of Computational Linguistics, Antwerp, pp. 1–13.Google Scholar
Church, K. W., Gale, W., Hanks, P., and Hindle, D. 1991. Using statistics in lexical analysis. In Zernik, U., (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115164. Hillsdale: Lawrence Erlbaum Associates.Google Scholar
Church, K. W. and Hanks, P., 1990. Word association norms, mutual information and lexicography. Computational Linguistics 6 (1): 2229.Google Scholar
Conde, A., Larrañaga, M., Arruarte, A., Elorriaga, J. A., and Roth, D., 2016. LiteWi: a combined term extraction method for eliciting educational ontologies from textbooks. Journal of the Association for Information Science and Technology 67 (2): 380399.Google Scholar
Conrado, M. S., Felippo, A., Pardo, T. A. S., and Rezende, S. O., 2014. A survey of automatic term extraction for Brazilian Portuguese. Journal of the Brazilian Computer Society 20 (12): 128.Google Scholar
Deane, P. 2005. A nonparametric method for extraction of candidate phrasal terms. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics. Michigan: Association for Computer Linguistics, pp. 605–613.Google Scholar
Drouin, P., 2003. Term extraction using non-technical corpora as a point of leverage. Terminology 9 (1): 99117.Google Scholar
Dunning, T., 1994. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 6174.Google Scholar
Everitt, B., 1992. The Analysis of Contingency Tables. London: Chapman and Hall/CRC.Google Scholar
Fedorenko, D., Astrakhantsev, N., and Turdakov, D. 2013. Automatic recognition of domain-specific terms: an experimental evaluation. In Proceedings of the 9th Spring Researcher’s Colloquium on Database and Information Systems, pp. 15–23.Google Scholar
Fox, C., 1990. A stop list for general text. ACM-SIGIR Forum 24 : 1935.Google Scholar
Francis, W. N., and Kučera, H., 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin.Google Scholar
Frantzi, K., and Ananiadou, S. 1996. Extracting nested collocations. In Proceedings of the 16th International Conference on Computational Linguistics. Morristown: Association for Computational Linguistics, pp. 41–46.Google Scholar
Frantzi, K., Ananiadou, S., and Mima, H., 2000. Automatic recognition of multi-word terms. International Journal of Digital Libraries 3 (2): 117132.Google Scholar
Gale, W., and Church, K. W. 1991. Concordances for parallel texts. In Proceedings of the 7th Annual Conference of the UW Center for the New OED and Text Research, Oxford, pp. 40–62.Google Scholar
Haan, P. 1992. The optimum corpus sample size? In Leitner, G. (ed.), New Dimensions in English Language Corpora, pp. 319. Berlin-NewYork: Mouton de Gruyter.Google Scholar
Harman, D. 1986. An experimental study of factors important in document ranking. In Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, pp. 186–193.Google Scholar
Hatcher, E., Gospodnetic, O., and McCandless, M., 2010. Lucene in Action. Greenwich: Manning.Google Scholar
Hunston, S. 2008. Collection strategies and design decisions. In Lüdeling, A., and Kytö, M. (eds.), Corpus Linguistics: An International Handbook, vol. 1, pp. 154168. Berlin-New York: Mouton de Gruyter.Google Scholar
ISO 704.,2009. Terminology Work – Principles and Methods. Geneva: International Organization for Standardization.Google Scholar
Ittoo, A., Maruster, L., Wortmann, H., and Bouma, G. 2010. Textractor: a framework for extracting relevant domain concepts from irregular corporate textual datasets. In Abramowicz, W., and Tolksdorf, R. (eds.), Business Information Systems. Lecture Notes in Business Information Processing, vol. 47, pp. 7182. Heidelberg: Springer.CrossRefGoogle Scholar
Jacquey, E., Tutin, A., Kister, L., Jacques, M., Hatier, S., and Ollinger, S. 2013. Filtrage terminologique par le lexique transdisciplinaire scientifique: une expérimentation en sciences humaines. In Proceedings of the 10th International Conference on Terminology and Artificial Intelligence (TIA 2013). Villetaneuse, pp. 121–128.Google Scholar
Justeson, J. S., and Katz, S. M., 1995. Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1 (1): 927.Google Scholar
Kageura, K., and Umino, B., 1996. Methods of automatic term recognition: A review. Terminology 3 (2): 259289.Google Scholar
Karystianis, G., Buchan, I., and Nenadic, G. 2014. Mining characteristics of epidemiological studies from Medline: a case study in obesity. Journal of Biomedical Semantics 5, 22: 111.Google Scholar
Khosrow-Pour, M., 2009. Encyclopedia of Information Science and Technology. Hershey: Information Science Reference.Google Scholar
Knoth, P., Schmidt, M., Smrz, P., and Zdráhal, Z. 2009. Towards a framework for comparing automatic term recognition methods. In Proceedings of the 8th Annual Conference Znalosti. Bratislava: Informatics and Information Technology STU, pp. 83–94.Google Scholar
Koester, A. 2010. Building small specialized corpora. In O’Keeffe, A., and McCarthy, M. (eds.), The Routledge Handbook of Corpus Linguistics, pp. 6679. London: Routledge.Google Scholar
Korkontzelos, I., Klapaftis, I., and Manandhar, S. 2008. Reviewing and evaluating automatic term recognition techniques. In Proceedings of the 6th International Conference on Advances in Natural Language Processing. Berlin-Heidelberg: Springer, pp. 248–259.Google Scholar
Lochbaum, K. E., and Streeter, L. A., 1989. Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval. Information Processing and Management 25 (6): 665676.Google Scholar
Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M. 2014a. BioTex: a system for biomedical terminology extraction, ranking and validation. In Proceedings of the 13th International Semantic Web Conference, pp. 157–160.Google Scholar
Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M., 2014a. Towards a mixed approach to extract biomedical terms from text corpus. International Journal of Knowledge Discovery in Bioinformatics 4 (1): 115.Google Scholar
Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M. 2014c. Yet another ranking function to automatic multi-word term extraction. In Proceedings of the 9th International Conference on Natural Language Processing, Warsaw.Google Scholar
Luhn, H. P., 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2 (2): 159165.Google Scholar
Marín, M. J., 2015. Measuring precision in legal term mining: a corpus-based validation of single and multi-word term recognition methods. ESP World 46 : 123.Google Scholar
Merkel, M., Foo, J., and Ahrenberg, L. 2013. IPhraxtor – a linguistically informed system for extraction of term candidates. In Proceedings of the 19th Nordic Conference on Computational Linguistics, pp. 121–132. Oslo: Linkoping University Electronic Press.Google Scholar
Meyers, A., He, Y., Glass, Z., and Babko-Malaya, O. 2015. The Termolator: terminology recognition based on chunking, statistical and search-based scores. In Proceedings of the First Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics, Istanbul, pp. 34–43.Google Scholar
Nagao, M., Mizutani, M., and Ikeda, H., 1976. An automated method of the extraction of important words from Japanese scientific documents. Transactions of the Information Processing Society of Japan 17 (2): 110117.Google Scholar
Oakes, M., 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.Google Scholar
Park, Y., Byrd, R. J., and Boguraev, B. 2002. Automatic glossary extraction: beyond terminology identification. In Proceedings of the 19th International Conference on Computational Linguistics. Taipei: Howard International House and Academia Sinica, pp. 1–7.Google Scholar
Paulo, J. L., and Mamede, N. J. 2004. Terms spotting with linguistics and statistics. In G. De Ita Luna, O. Fuentes Chávez, and M. Osorio Galindo (eds.), Proceedings of the International Workshop Taller de Herramientas y Recursos Linguísticos para el Español y el Portugués, IX Iberoamerican Conference on Artificial Intelligence, pp. 298–304.Google Scholar
Pazienza, M. T., Pennacchiotti, M., and Zanzotto, F. M. 2005. Terminology extraction: an analysis of linguistic and statistical approaches. In Sirmakessis, S. (ed.), Knowledge Mining. Studies in Fuzziness and Soft Computing, vol. 185, pp. 255279. Heidelberg: Springer.Google Scholar
Periñán-Pascual, C., 2015. The underpinnings of a composite measure for automatic term extraction: the case of SRC. Terminology 21 (2): 151179.Google Scholar
Quasthoff, U., Richter, M., and Biemann, C. 2006. Corpus portal for search in monolingual corpora. In Proceedings of LREC-06, Genova, pp. 1799–1802.Google Scholar
Robertson, S. E., Walker, S., and Beaulieu, M. 1998. Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In Proceedings of the 7th Text Retrieval Conference, Gaithersburg: National Institute of Standards and Technology, pp. 253–264.Google Scholar
Sajjacholapunt, P., and Joy, M. 2015. Analysing features of lecture slides and past exam paper materials. Towards automatic associating E-materials for self-revision. In Proceedings of the 7th International Conference on Computer Supported Education, Lisbon: SciTePress, pp. 169–176.Google Scholar
Salton, G. (ed.), 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs: Prentice-Hall.Google Scholar
Salton, G., and Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513523.Google Scholar
Salton, G., and McGill, M., 1983. Introduction to Modern Information Retrieval. New York: McGraw Hill.Google Scholar
Salton, G., Wong, A., and Yang, C. S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 (11): 613620.Google Scholar
Salton, G., Yang, C. S., and Yu, C. T., 1975. A theory of term importance in automatic text analysis. Journal of the American Society for Information Science 26 (1): 3344.Google Scholar
Silva, J. F., Dias, G., Guilloré, S., and Lopes, G. P. 1999. Using LocalMaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In Barahona, P. (ed.), Progress in Artificial Intelligence: 9th Portuguese Conference on AI, pp. 113132. Heidelberg: Springer.Google Scholar
Silva, J. F., and Lopes, G. P. 1999. A local maxima method and a fair dispersion normalization for extracting multiword units. In Proceedings of the 6th Meeting on the Mathematics of Language, Orlando, pp. 369–381.Google Scholar
Sinclair, I., 2011. Electronics Simplified. Oxford: Newnes-Elsewier.Google Scholar
Singhal, A., Buckley, C., and Mitra, M. 1996. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM press, pp. 21–29.Google Scholar
Sinka, M. P., and Corne, D. W. 2003. Towards modernised and web-specific stoplists for web document analysis. In Proceedings of IEEE Web Intelligence 2003. Los Alamitos (California): IEEE Computer Society, pp. 396–404.Google Scholar
Smadja, F., McKeown, K. R., and Hatzivassiloglou, V., 1996. Translating collocations for bilingual lexicons: a statistical approach. Journal of Computational Linguistics 22 (1): 138.Google Scholar
Sun, Q., Shaw, D., and Davis, C. H., 1999. A model for estimating the occurrence of same-frequency words and the boundary between high- and low-frequency words in texts. Journal of the American Society for Information Science 50 (3): 280286.Google Scholar
Thurmair, G. 2003. Making term extraction tools usable. In Proceedings of The Joint Conference of the 8th International Workshop of the European Association of Machine Translation and the 4th Controlled Language Applications Workshop. Dublin: European Association for Machine Translation, pp. 1–10.Google Scholar
Vivaldi, J., Màrquez, L., and Rodríguez, H. 2001. Improving term extraction by system combination using boosting. In Proceedings of the 12th European Conference on Machine Learning, pp. 515–526. Heidelberg: Springer.CrossRefGoogle Scholar
Vivaldi, J., and Rodríguez, H., 2007. Evaluation of terms and term extraction systems: a practical approach. Terminology 13 (2): 225248.Google Scholar
Wermter, J., and Hahn, U. 2005. Finding new terminology in very large corpora. In P. Clark, and G. Schreiber (eds.), Proceedings of the 3rd International Conference on Knowledge Capture, pp. 137–144. Alberta: Association for Computing Machinery.Google Scholar
Wiechmann, D., and Fuhs, S., 2006. Corpus linguistics resources. Concordancing software. Corpus Linguistics and Linguistic Theory 2 (1): 109–30.CrossRefGoogle Scholar
Wong, W., Liu, W., and Bennamoun, M. 2008. Determination of unithood and termhood for term recognition. In Song, M., and Wu, Y. (eds.), Handbook of Research on Text and Web Mining Technologies, pp. 500529. Hershey-New York: IGI Global.Google Scholar
Zadeh, B. Q., and Handschuh, S. 2014a. Evaluation of technology term recognition with random indexing. In Proceedings of the 9th International Conference on Language Resources and Evaluation. Reykjavik: European Language Resources Association, pp. 4027–4032.Google Scholar
Zadeh, B. Q., and Handschuh, S. 2014b. The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In Proceedings of the 4th International Workshop on Computational Terminology, Dublin: Association for Computational Linguistics, pp. 52–63.Google Scholar
Zhang, Z., Iria, J., Brewster, C., and Ciravegna, F. 2008. A comparative evaluation of term recognition algorithms. In Proceedings of the 6th International Conference on Language Resources and Evaluation. Luxemburg: European Language Resources Association, pp. 2108–2113.Google Scholar
Zorrilla-Agut, P. 2014. When IATE met LISE: LISE clean-up and consolidation tools take on the IATE challenge. In Budin, G., and Lušicky, V. (eds.), Languages for Special Purposes in a Multilingual, Transcultural World. Proceedings of the 19th European Symposium on Languages for Special Purposes, pp. 536545. Vienna: University of Vienna.Google Scholar
Zou, F., Wang, F. L., Deng, X., Han, S., and Wang, L. S. 2006. Automatic construction of Chinese stop word list. In Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, pp. 1010–1015.Google Scholar