Hostname: page-component-586b7cd67f-tf8b9 Total loading time: 0 Render date: 2024-11-27T12:51:52.788Z Has data issue: false hasContentIssue false

ELHISA: An architecture for the integration of heterogeneous lexical information

Published online by Cambridge University Press:  01 April 2008

XABIER ARTOLA
Affiliation:
University of the Basque Country email: [email protected]
AITOR SOROA
Affiliation:
University of the Basque Country email: [email protected]

Abstract

The design and construction of lexical resources is a critical issue in Natural Language Processing (NLP). Real-world NLP systems need large-scale lexica, which provide rich information about words and word senses at all levels: morphologic, syntactic, lexical semantics, etc., but the construction of lexical resources is a difficult and costly task. The last decade has been highly influenced by the notion of reusability, that is, the use of the information of existing lexical resources in constructing new ones. It is unrealistic, however, to expect that the great variety of available lexical information resources could be converted into a single and standard representation schema in the near future. The purpose of this article is to present the ELHISA system, a software architecture for the integration of heterogeneous lexical information. We address, from the point of view of the information integration area, the problem of querying very different existing lexical information sources using a unique and common query language. The integration in ELHISA is performed in a logical way, so that the lexical resources do not suffer any modification when integrating them into the system. ELHISA is primarily defined as a consultation system for accessing structured lexical information, and therefore it does not have the capability to modify or update the underlying information. For this purpose, a General Conceptual Model (GCM) for describing diverse lexical data has been conceived. The GCM establishes a fixed vocabulary describing objects in the lexical information domain, their attributes, and the relationships among them. To integrate the lexical resources into the federation, a Source Conceptual Model (SCM) is built on the top of each one, which represents the lexical objects concurring in each particular source. To answer the user queries, ELHISA must access the integrated resources, and, hence, it must translate the query expressed in GCM terms into queries formulated in terms of the SCM of each source. The relation between the GCM and the SCMs is explicitly described by means of mapping rules called Content Description Rules. Data integration at the extensional level is achieved by means of the data cleansing process, needed if we want to compare the data arriving from different sources. In this process, the object identification step is carried out. Based on this architecture, a prototype named ELHISA has been built, and five resources covering a broad scope have been integrated into it so far for testing purposes. The fact that such heterogeneous resources have been integrated with ease into the system shows, in the opinion of the authors, the suitability of the approach taken.

Type
Papers
Copyright
Copyright © Cambridge University Press 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abiteboul, S., and Duschka, O. M. 1998. Complexity of answering queries using materializated views. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'98), pp. 254–263.Google Scholar
Agirre, E., Arregi, X., Artola Zubillaga, X., Díaz de Ilarraza, A., and Sarasola, K. 1994. Lexical knowledge representation in an intelligent dictionary help system. In Proceedings of COLING'94 Kyoto, Japan, vol. 1, pp. 544–550.Google Scholar
Agirre, E., Ansa, O., Arriola, J. M., Díaz de Ilarraza, A., Pociello, E., and Uria, L. 2002. Methodological issues in the building of the Basque WordNet: quantitative and qualitative analysis. In Proceedings of the first International WordNet Conference, Mysore, India, pp. 21–25.Google Scholar
Aldezabal, I., Ansa, O., Arrieta, B., Artola, X., Ezeiza, A., Hernández, G., and Lersundi, M. 2001. EDBL: a general lexical basis for the automatic processing of Basque. In IRCS Workshop on linguistic databases, Philadelphia, PA, pp. 1–10.Google Scholar
Arens, Y., Hsu, C.-N., and Knoblock, C. A. 1996. Query processing in the SIMS information mediator. In Advanced Planning Technology. San Jose, CA: AAAI Press.Google Scholar
Artola, X., and Soroa, A. 2001a. An architecture for a federation of highly heterogeneous lexical information sources. In IRCS Workshop on linguistic databases, Philadelphia, PA, pp. 17–23.Google Scholar
Artola, X. and Soroa, A. 2001b. Using data integration techniques in a federation of heterogeneus lexical databases. In Proceedings of NAACL. Workshop on ‘Wordnet and Other Lexical Resources: Applications, Extensions and Customizations’, Pittsburgh, PA, pp. 168–170.Google Scholar
Beery, C., Levy, A. Y., and Rousset, M.-C. 1997. Rewritting queries using views in description logics. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'97), Tucson, AZ, pp. 99–108.Google Scholar
Bel, N., Busa, F., Calzolari, N., Gola, E., Lenci, A., Monachini, M., Ogonowski, A., Peters, I., Peters, W., Ruimy, N., Villegas, M., and Zampolli, A. 2000. SIMPLE: a general framework for the development of multilingual lexicons. In 2nd International Conference on Language Resources and Evaluation (LREC2000), Athens. Greece, pp. 1379–1384.Google Scholar
Brachman, R., Borgida, A., McGuinness, D., Patel-Schneider, P., and Resnick, L. 1992. The CLASSIC knowledge representation system of, KL-ONE: the next generation. In Proceedings of the International Conference on Fifth Generation Computer Systems, ICOT, Japan. New York: Association for Computing Machinery, pp. 1036–1043.Google Scholar
Calvanese, D., De Giacomo, G., and Lenzerini, M. 1999a. Answering queries using views in description logics. In Proc. of the 1999 Description Logics Workshop (DL'99), CEUR Workshop, Linköping, Sweden, vol. 2, pp. 9–13.Google Scholar
Calvanese, D., Giacomo, G. D., Lenzerini, M., Nardi, D., and Rosati, R. 1999b. A principled approach to data integration and reconciliation in data warehousing. In Proc. of the Int. Workshop on Design and Management of Data Warehouses, CEUR Electronic Workshop Proc., http://ceur-ws.org./vol-19.Google Scholar
Calzolari, N., Zampolli, A., and Lenci, A. 2002. Towards a standard for a multilingual lexical entry: the EAGLES/ISLE initiative. In Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico, New York: Springer-Verlag, pp. 264–279.Google Scholar
Cunningham, H., Bontcheva, K., Peters, W., and Wilks, Y. 2000. Uniform language resource access and distribution in the context of a General Architecture for Text Engineering (GATE). In Proceedings of the Workshop on Ontologies and Language Resources (OntoLex'2000), Sozopol, Bulgaria.Google Scholar
Duschka, O. M. and Genesereth, M. R. 1997. Answering recursive queries using views. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles on Database Systems, Tucson, AZ, pp. 109–116.Google Scholar
Florescu, D., Levy, A. Y., and Mendelzon, A. 1998. ‘Database techniques for the World-Wide Web: a survery’. ACM SIGMOD Record. 27 (3): 5974.CrossRefGoogle Scholar
Galhardas, H., Florescu, D., Shasha, D., and Simon, E. 2001. Declarative data cleaning: language, model, and algorithms. In Proc. of 27th International Conference on Very Large Data Bases, Rouce, Italy, pp. 371–380.Google Scholar
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J. D., Vassalos, V., and Widom, J. 1997. The TSIMMIS approach to mediation: data models and languages. Journal of Intelligent Information Systems 8 (2): 117132.CrossRefGoogle Scholar
Heimbigner, D. and McLeod, D. 1985. A federate architecture for information managment. ACM Transactions on Office Information Systems 3 (3): 253278.CrossRefGoogle Scholar
Ives, B. and Jarvenpaa, S. L. 1991. Applications of global information technology: key issues for management 15 (1): 3349. MIS Quarterly.CrossRefGoogle Scholar
Jarke, M., Lenzerini, M., Vassiliou, Y., and Vassiliadis, P. 2003. Fundamentals of Data Warehouses, 2nd ed. Berlin, Heidelberg, New York: Springer-Verlag.CrossRefGoogle Scholar
Jing, H. and McKeown, K. 1998. Combining multiple, large-scale resources in a reusable lexicon for natural language generation. In 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL'98), Montreal, Quebec, Canada, pp. 607–613.Google Scholar
Kwong, O. Y. 2001. Word sense disambiguation with an integrated lexical resource. In Proceedings of NAACL. Workshop onWordnet and Other Lexical Resources: Applications, Extensions and Customizations’, Pittsburgh, PA, pp. 11–16.Google Scholar
Levy, A. Y. 1998. The Information Manifold approach to data integration. IEEE Intelligent Systems 13: 1216.Google Scholar
Levy, A. Y., Mendelzon, A. O., Sagiv, Y., and Srivastava, D. 1995. Answering queries using views. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'95), San Jose, CA, pp. 95–104.Google Scholar
Levy, A. Y., Rajaraman, A., and Ordille, J. J. 1996. Querying heterogeneous information sources using source descriptions. In Proc. of the 1996 Conference on Very Large Data Bases (VLDB'96), pp. 251–262.Google Scholar
MacNaught, J. 1990. Reusability of lexical and terminological resources; steps towards the independence. In Proc. of Int. Workshop on Electronic Dictionaries, Kanagawa, Japan, pp. 97–107.Google Scholar
Mena, E., Kashyap, V., Seth, A. P., and Illarramendi, A. 2000. OBSERVER: and approach for query processing in global information systems based on interoperation across pre-existing ontologies. International Journal of Distributed and Parallel Databases (DAPD) 8 (2): 223271.CrossRefGoogle Scholar
Mitra, P. 1999. An algorithm for efficiently answering queries using views. Technical Report, Infolab, Stanford University, Palo Alto, CA.Google Scholar
Monge, A. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'97) in cooperation with ACM-SIGMOD97, Tucson, AZ.Google Scholar
Normier, B., and Nossim, M. 1990. GENELEX project: EUREKA for linguist engineering. In Proc. of Int. Workshop on Electronic Dictionaries, Kanagawa, Japan, pp. 63–70.Google Scholar
Patrick, J., Zhang, J., and Artola, X. 1999. An architecture and query language for a federation of heterogeneous lexical and dictionary databases. Computers and the Humanities 34: 393407.CrossRefGoogle Scholar
Pottinger, R. and Levy, A. Y. 2000. A scalable algorithm for answering queries using views. In Proc. of the 26th International Conference on Very Large Data Bases (VLDB'2000), Cairo, Egypt, pp. 484–495.Google Scholar
Rahm, E. and Do, H.-H. 2000. Data cleaning: problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23 (4): 313.Google Scholar
Roth, M. T., and Schwartz, P. 1997. Don't Scrap It, Wrap it A wrapper architecture for legacy data sources. In Proc. of the 23rd International Conference on Very Large Data Bases (VLDB'97), pp. 266–275.Google Scholar
Ruimy, N., Corazzari, O., Elisabetta, G., Spanu, A., Calzolari, N., and Zampolli, A. 1998. The European LE-PAROLE project and the Italian lexical instantiation. In ALLC/ACH, Lajos Kossuth University, Debrecen, Hungary, pp. 149–153.Google Scholar
Sarasola, I. 1996. Euskal Hiztegia. Donostia: Kutxa Fundazioa.Google Scholar
Seth, A. P. and Larson, J. A. 1990. Federated database systems for managing distributed, heterogeneous and autonomous databases. ACM Computing Surveys 22 (3): 183236.CrossRefGoogle Scholar
Shi, L. and Mihalcea, R. 2005. Putting pieces together: combining Framenet, Verbnet and Wordnet for robust semantic parsing. In CICLing, Mexico City, Mexico, pp. 100–111.Google Scholar
Soroa, A. 2004. Izaera heterogeneoko baliabide lexikalen integraziorako arkitektura baten proposamena. Datu-integrazioaren ikuspegitik egindako ekarpena, PhD thesis. Donostia: Euskal Herriko Unibertsitatea.Google Scholar
Tejada, S., Knoblock, C. A., and Minton, S. 2001. Learning object identification rules for information integration. Special Issue on Data Extraction, Cleaning, and Reconciliation Information Systems Journal 26 (8): 607633.Google Scholar
Ullman, J. D. 1997. Information integration using logical views. In Afrati, F. N., and Kolaitis, P., (eds.), Database Theory—ICDT'97, 6th International Conference, Delphi, Greece, vol. 1186 of Lecture Notes in Computer Science, pp. 1940, New York: Springer.Google Scholar
Uszkoreit, H., Backofen, R., Calder, J., Capstick, J., Dini, L., Dörre, J., Erbach, G., Estival, D., Manandhar, S., Mineur, AM., and Oepen, S. 1996. The EAGLES Formalisms Working Group — final Report Expert Advisory Group on Language Engineering Standards. Technical Report LRE 61–100.Google Scholar
Valverde, A. 2003. Integración de la información: Una arquitectura basada en wrappers. Undergraduate project. Informatika Fakultatea, Euskal Herriko Unibertsitatea.Google Scholar
Yang, H. Z., and Larson, P. A. 1987. Query transformation for psj-queries. In Proc. of the International Conference on Very Large Data Bases (VLDB), Brighton, England, pp. 245–254.Google Scholar
Yokoi, T. 1995. The EDR electronic dictionary. Communications of the ACM 38 (11): 4244.CrossRefGoogle Scholar
Zajac, R. 1999. On some aspects of lexical standardization. In ACL/SIGLEX99 - Standardizing Lexical Resources, University of Maryland, College Park, M.D, pp. 38–45.Google Scholar