ELHISA: An architecture for the integration of heterogeneous lexical information

XABIER ARTOLA; AITOR SOROA

doi:10.1017/S1351324907004615

ELHISA: An architecture for the integration of heterogeneous lexical information

Published online by Cambridge University Press: 01 April 2008

XABIER ARTOLA and

AITOR SOROA

Show author details

XABIER ARTOLA: Affiliation:
University of the Basque Country email: [email protected]
AITOR SOROA: Affiliation:
University of the Basque Country email: [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

The design and construction of lexical resources is a critical issue in Natural Language Processing (NLP). Real-world NLP systems need large-scale lexica, which provide rich information about words and word senses at all levels: morphologic, syntactic, lexical semantics, etc., but the construction of lexical resources is a difficult and costly task. The last decade has been highly influenced by the notion of reusability, that is, the use of the information of existing lexical resources in constructing new ones. It is unrealistic, however, to expect that the great variety of available lexical information resources could be converted into a single and standard representation schema in the near future. The purpose of this article is to present the ELHISA system, a software architecture for the integration of heterogeneous lexical information. We address, from the point of view of the information integration area, the problem of querying very different existing lexical information sources using a unique and common query language. The integration in ELHISA is performed in a logical way, so that the lexical resources do not suffer any modification when integrating them into the system. ELHISA is primarily defined as a consultation system for accessing structured lexical information, and therefore it does not have the capability to modify or update the underlying information. For this purpose, a General Conceptual Model (GCM) for describing diverse lexical data has been conceived. The GCM establishes a fixed vocabulary describing objects in the lexical information domain, their attributes, and the relationships among them. To integrate the lexical resources into the federation, a Source Conceptual Model (SCM) is built on the top of each one, which represents the lexical objects concurring in each particular source. To answer the user queries, ELHISA must access the integrated resources, and, hence, it must translate the query expressed in GCM terms into queries formulated in terms of the SCM of each source. The relation between the GCM and the SCMs is explicitly described by means of mapping rules called Content Description Rules. Data integration at the extensional level is achieved by means of the data cleansing process, needed if we want to compare the data arriving from different sources. In this process, the object identification step is carried out. Based on this architecture, a prototype named ELHISA has been built, and five resources covering a broad scope have been integrated into it so far for testing purposes. The fact that such heterogeneous resources have been integrated with ease into the system shows, in the opinion of the authors, the suitability of the approach taken.

Type: Papers
Information: Natural Language Engineering , Volume 14 , Issue 2 , April 2008 , pp. 253 - 281

DOI: https://doi.org/10.1017/S1351324907004615 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abiteboul, S., and Duschka, O. M. 1998. Complexity of answering queries using materializated views. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'98), pp. 254–263.Google Scholar

Agirre, E., Arregi, X., Artola Zubillaga, X., Díaz de Ilarraza, A., and Sarasola, K. 1994. Lexical knowledge representation in an intelligent dictionary help system. In Proceedings of COLING'94 Kyoto, Japan, vol. 1, pp. 544–550.Google Scholar

Agirre, E., Ansa, O., Arriola, J. M., Díaz de Ilarraza, A., Pociello, E., and Uria, L. 2002. Methodological issues in the building of the Basque WordNet: quantitative and qualitative analysis. In Proceedings of the first International WordNet Conference, Mysore, India, pp. 21–25.Google Scholar

Aldezabal, I., Ansa, O., Arrieta, B., Artola, X., Ezeiza, A., Hernández, G., and Lersundi, M. 2001. EDBL: a general lexical basis for the automatic processing of Basque. In IRCS Workshop on linguistic databases, Philadelphia, PA, pp. 1–10.Google Scholar

Arens, Y., Hsu, C.-N., and Knoblock, C. A. 1996. Query processing in the SIMS information mediator. In Advanced Planning Technology. San Jose, CA: AAAI Press.Google Scholar

Artola, X., and Soroa, A. 2001a. An architecture for a federation of highly heterogeneous lexical information sources. In IRCS Workshop on linguistic databases, Philadelphia, PA, pp. 17–23.Google Scholar

Artola, X. and Soroa, A. 2001b. Using data integration techniques in a federation of heterogeneus lexical databases. In Proceedings of NAACL. Workshop on ‘Wordnet and Other Lexical Resources: Applications, Extensions and Customizations’, Pittsburgh, PA, pp. 168–170.Google Scholar

Beery, C., Levy, A. Y., and Rousset, M.-C. 1997. Rewritting queries using views in description logics. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97), Tucson, AZ, pp. 99–108.Google Scholar

Bel, N., Busa, F., Calzolari, N., Gola, E., Lenci, A., Monachini, M., Ogonowski, A., Peters, I., Peters, W., Ruimy, N., Villegas, M., and Zampolli, A. 2000. SIMPLE: a general framework for the development of multilingual lexicons. In 2nd International Conference on Language Resources and Evaluation (LREC2000), Athens. Greece, pp. 1379–1384.Google Scholar

Brachman, R., Borgida, A., McGuinness, D., Patel-Schneider, P., and Resnick, L. 1992. The CLASSIC knowledge representation system of, KL-ONE: the next generation. In Proceedings of the International Conference on Fifth Generation Computer Systems, ICOT, Japan. New York: Association for Computing Machinery, pp. 1036–1043.Google Scholar

Calvanese, D., De Giacomo, G., and Lenzerini, M. 1999a. Answering queries using views in description logics. In Proc. of the 1999 Description Logics Workshop (DL'99), CEUR Workshop, Linköping, Sweden, vol. 2, pp. 9–13.Google Scholar

Calvanese, D., Giacomo, G. D., Lenzerini, M., Nardi, D., and Rosati, R. 1999b. A principled approach to data integration and reconciliation in data warehousing. In Proc. of the Int. Workshop on Design and Management of Data Warehouses, CEUR Electronic Workshop Proc., http://ceur-ws.org./vol-19.Google Scholar

Calzolari, N., Zampolli, A., and Lenci, A. 2002. Towards a standard for a multilingual lexical entry: the EAGLES/ISLE initiative. In Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico, New York: Springer-Verlag, pp. 264–279.Google Scholar

Cunningham, H., Bontcheva, K., Peters, W., and Wilks, Y. 2000. Uniform language resource access and distribution in the context of a General Architecture for Text Engineering (GATE). In Proceedings of the Workshop on Ontologies and Language Resources (OntoLex'2000), Sozopol, Bulgaria.Google Scholar

Duschka, O. M. and Genesereth, M. R. 1997. Answering recursive queries using views. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles on Database Systems, Tucson, AZ, pp. 109–116.Google Scholar

Florescu, D., Levy, A. Y., and Mendelzon, A. 1998. ‘Database techniques for the World-Wide Web: a survery’. ACM SIGMOD Record. 27 (3): 59–74.CrossRef Google Scholar

Galhardas, H., Florescu, D., Shasha, D., and Simon, E. 2001. Declarative data cleaning: language, model, and algorithms. In Proc. of 27th International Conference on Very Large Data Bases, Rouce, Italy, pp. 371–380.Google Scholar

Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J. D., Vassalos, V., and Widom, J. 1997. The TSIMMIS approach to mediation: data models and languages. Journal of Intelligent Information Systems 8 (2): 117–132.CrossRef Google Scholar

Heimbigner, D. and McLeod, D. 1985. A federate architecture for information managment. ACM Transactions on Office Information Systems 3 (3): 253–278.CrossRef Google Scholar

Ives, B. and Jarvenpaa, S. L. 1991. Applications of global information technology: key issues for management 15 (1): 33–49. MIS Quarterly.CrossRef Google Scholar

Jarke, M., Lenzerini, M., Vassiliou, Y., and Vassiliadis, P. 2003. Fundamentals of Data Warehouses, 2nd ed. Berlin, Heidelberg, New York: Springer-Verlag.CrossRef Google Scholar

Jing, H. and McKeown, K. 1998. Combining multiple, large-scale resources in a reusable lexicon for natural language generation. In 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL'98), Montreal, Quebec, Canada, pp. 607–613.Google Scholar

Kwong, O. Y. 2001. Word sense disambiguation with an integrated lexical resource. In Proceedings of NAACL. Workshop on ‘Wordnet and Other Lexical Resources: Applications, Extensions and Customizations’, Pittsburgh, PA, pp. 11–16.Google Scholar

Levy, A. Y. 1998. The Information Manifold approach to data integration. IEEE Intelligent Systems 13: 12–16.Google Scholar

Levy, A. Y., Mendelzon, A. O., Sagiv, Y., and Srivastava, D. 1995. Answering queries using views. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'95), San Jose, CA, pp. 95–104.Google Scholar

Levy, A. Y., Rajaraman, A., and Ordille, J. J. 1996. Querying heterogeneous information sources using source descriptions. In Proc. of the 1996 Conference on Very Large Data Bases (VLDB'96), pp. 251–262.Google Scholar

MacNaught, J. 1990. Reusability of lexical and terminological resources; steps towards the independence. In Proc. of Int. Workshop on Electronic Dictionaries, Kanagawa, Japan, pp. 97–107.Google Scholar

Mena, E., Kashyap, V., Seth, A. P., and Illarramendi, A. 2000. OBSERVER: and approach for query processing in global information systems based on interoperation across pre-existing ontologies. International Journal of Distributed and Parallel Databases (DAPD) 8 (2): 223–271.CrossRef Google Scholar

Mitra, P. 1999. An algorithm for efficiently answering queries using views. Technical Report, Infolab, Stanford University, Palo Alto, CA.Google Scholar

Monge, A. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'97) in cooperation with ACM-SIGMOD97, Tucson, AZ.Google Scholar

Normier, B., and Nossim, M. 1990. GENELEX project: EUREKA for linguist engineering. In Proc. of Int. Workshop on Electronic Dictionaries, Kanagawa, Japan, pp. 63–70.Google Scholar

Patrick, J., Zhang, J., and Artola, X. 1999. An architecture and query language for a federation of heterogeneous lexical and dictionary databases. Computers and the Humanities 34: 393–407.CrossRef Google Scholar

Pottinger, R. and Levy, A. Y. 2000. A scalable algorithm for answering queries using views. In Proc. of the 26th International Conference on Very Large Data Bases (VLDB'2000), Cairo, Egypt, pp. 484–495.Google Scholar

Rahm, E. and Do, H.-H. 2000. Data cleaning: problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23 (4): 3–13.Google Scholar

Roth, M. T., and Schwartz, P. 1997. Don't Scrap It, Wrap it A wrapper architecture for legacy data sources. In Proc. of the 23rd International Conference on Very Large Data Bases (VLDB'97), pp. 266–275.Google Scholar

Ruimy, N., Corazzari, O., Elisabetta, G., Spanu, A., Calzolari, N., and Zampolli, A. 1998. The European LE-PAROLE project and the Italian lexical instantiation. In ALLC/ACH, Lajos Kossuth University, Debrecen, Hungary, pp. 149–153.Google Scholar

Sarasola, I. 1996. Euskal Hiztegia. Donostia: Kutxa Fundazioa.Google Scholar

Seth, A. P. and Larson, J. A. 1990. Federated database systems for managing distributed, heterogeneous and autonomous databases. ACM Computing Surveys 22 (3): 183–236.CrossRef Google Scholar

Shi, L. and Mihalcea, R. 2005. Putting pieces together: combining Framenet, Verbnet and Wordnet for robust semantic parsing. In CICLing, Mexico City, Mexico, pp. 100–111.Google Scholar

Soroa, A. 2004. Izaera heterogeneoko baliabide lexikalen integraziorako arkitektura baten proposamena. Datu-integrazioaren ikuspegitik egindako ekarpena, PhD thesis. Donostia: Euskal Herriko Unibertsitatea.Google Scholar

Tejada, S., Knoblock, C. A., and Minton, S. 2001. Learning object identification rules for information integration. Special Issue on Data Extraction, Cleaning, and Reconciliation Information Systems Journal 26 (8): 607–633.Google Scholar

Ullman, J. D. 1997. Information integration using logical views. In Afrati, F. N., and Kolaitis, P., (eds.), Database Theory—ICDT'97, 6th International Conference, Delphi, Greece, vol. 1186 of Lecture Notes in Computer Science, pp. 19–40, New York: Springer.Google Scholar

Uszkoreit, H., Backofen, R., Calder, J., Capstick, J., Dini, L., Dörre, J., Erbach, G., Estival, D., Manandhar, S., Mineur, AM., and Oepen, S. 1996. The EAGLES Formalisms Working Group — final Report Expert Advisory Group on Language Engineering Standards. Technical Report LRE 61–100.Google Scholar

Valverde, A. 2003. Integración de la información: Una arquitectura basada en wrappers. Undergraduate project. Informatika Fakultatea, Euskal Herriko Unibertsitatea.Google Scholar

Yang, H. Z., and Larson, P. A. 1987. Query transformation for psj-queries. In Proc. of the International Conference on Very Large Data Bases (VLDB), Brighton, England, pp. 245–254.Google Scholar

Yokoi, T. 1995. The EDR electronic dictionary. Communications of the ACM 38 (11): 42–44.CrossRef Google Scholar

Zajac, R. 1999. On some aspects of lexical standardization. In ACL/SIGLEX99 - Standardizing Lexical Resources, University of Maryland, College Park, M.D, pp. 38–45.Google Scholar

Article contents

ELHISA: An architecture for the integration of heterogeneous lexical information

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests