Hostname: page-component-586b7cd67f-vdxz6 Total loading time: 0 Render date: 2024-11-25T00:44:00.914Z Has data issue: false hasContentIssue false

Using Latent Semantic Analysis and the Predication Algorithm to Improve Extraction of Meanings from a Diagnostic Corpus

Published online by Cambridge University Press:  10 January 2013

Guillermo Jorge-Botana
Affiliation:
Universidad Autónoma de Madrid (Spain)
Ricardo Olmos
Affiliation:
Universidad Autónoma de Madrid (Spain)
José Antonio León*
Affiliation:
Universidad Autónoma de Madrid (Spain)
*
Correspondence concerning this article should be addressed to José Antonio León. Departamento de Psicología Básica, Facultad de Psicología, Universidad Autónoma de Madrid, Campus de Cantoblanco, 28049 Madrid (Spain). Phone: +34-914975226. Fax: +34-914975215. E-mail: [email protected].

Abstract

There is currently a widespread interest in indexing and extracting taxonomic information from large text collections. An example is the automatic categorization of informally written medical or psychological diagnoses, followed by the extraction of epidemiological information or even terms and structures needed to formulate guiding questions as an heuristic tool for helping doctors. Vector space models have been successfully used to this end (Lee, Cimino, Zhu, Sable, Shanker, Ely & Yu, 2006; Pakhomov, Buntrock & Chute, 2006). In this study we use a computational model known as Latent Semantic Analysis (LSA) on a diagnostic corpus with the aim of retrieving definitions (in the form of lists of semantic neighbors) of common structures it contains (e.g. “storm phobia”, “dog phobia”) or less common structures that might be formed by logical combinations of categories and diagnostic symptoms (e.g. “gun personality” or “germ personality”). In the quest to bring definitions into line with the meaning of structures and make them in some way representative, various problems commonly arise while recovering content using vector space models. We propose some approaches which bypass these problems, such as Kintsch's (2001) predication algorithm and some corrections to the way lists of neighbors are obtained, which have already been tested on semantic spaces in a non-specific domain (Jorge-Botana, León, Olmos & Hassan-Montero, under review). The results support the idea that the predication algorithm may also be useful for extracting more precise meanings of certain structures from scientific corpora, and that the introduction of some corrections based on vector length may increases its efficiency on non-representative terms.

Actualmente existe un amplio interés en la indexación y extracción de información provenientes de grandes bancos de textos de índole taxonómica. Por ejemplo, la categorización automática de diagnósticos médicos o psicológicos redactados de manera informal y su consiguiente extracción de información epidemiológica o incluso en la extracción de términos y estructuras para la creación de preguntas-guía que asistan de forma heurística a los médicos en la búsqueda de información. Los modelos espacio-vectoriales han sido empleados con éxito en estos propósitos (Lee, Cimino, Zhu, Sable, Shanker, Ely, & Yu, 2006; Pakhomov, Buntrock, & Chute, 2006). En este estudio utilizamos un modelo computacional conocido como Análisis Semántico Latente (LSA) sobre un corpus diagnóstico con la motivación de recuperar definiciones (en forma de listados de vecinos semánticos) de estructuras habituales en ellos (e.g., “fobia a las tormentas”, “fobia a los perros”) o estructuras menos habituales, pero que pueden formarse por combinaciones lógicas de las categorías y síntomas diagnósticos (e.g., “personalidad de la pistola” o “personalidad de los gérmenes”). Para conseguir que las definiciones sean ajustadas al significado de las estructuras, y mínimamente representativas, se discuten algunos problemas que suelen surgir en la recuperación de contenidos con los modelos espacio-vectoriales, y se proponen algunas formas de evitarlos como el algoritmo de predicación de Kintsch (2001) y algunas correcciones en el modo de extraer listados de vecinos ya experimentadas sobre espacios semánticos de dominio general (Jorge-Botana, León, Olmos & Hassan-Montero, in review). Los resultados apoyan la idea de que el algoritmo de predicación puede ser también útil para extraer acepciones más precisas de ciertas estructuras en corpus científicos y que la introducción de algunas correcciones en base a la longitud de vector puede aumentar su eficacia ante términos poco representativos.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2009

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Blackmon, M.H., Polson, P.G., Kitajima, M.& Lewis, C. (2002). Cognitive Walkthrough for the Web. In CHI 2002: Proceedings of the conference on Human Factors in Computing Systems, (pp. 463470).Google Scholar
Blackmon, M. H. Cognitive Walkthrough. In Bainbridge, W. S. (Ed.), Encyclopedia of Human-Computer Interaction, 2 volumes. Great Barrington, MA: Berkshire Publishing, 2004.Google Scholar
Burek, G., Vargas-Vera, M.& Moreale, E. (2004). Document retrieval based on intelligent query formulation. Techreport ID: kmi-04-13 [Previously known as KMI-TR-148].Google Scholar
Burgess, C. (2000). Theory and operational definitions in computational memory models: A response to Glenberg and Robertson. Journal of Memory and Language, 43, 402408.CrossRefGoogle Scholar
Cederberg, S.& Widdows, D. (2003). Using LSA and noun coordination information to improve the precision and recall of automatic hyponymy extraction. Human Language Technology Conference archive. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL. Edmonton, Canada, 4.Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K.& Harshman, R. (1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391407.3.0.CO;2-9>CrossRefGoogle Scholar
Denhière, G., Lemaire, B., Bellissens, C.& Jhean-Larose, S. (2007). A Semantic Space Modelling Children's Semantic Memory. In Landauer, T. K. McNamara, D., Dennis, S. & Kintsch, W. (Eds.). The handbook of Latent Semantic Analysis (pp.143167). Mahwah, NJ: Erlbaum.Google Scholar
Dumais, S. (2003). Data-Driven approaches to information access, Cognitive Science, 2, 491524.Google Scholar
Glenberg, A. M.& Robertson, D. A. (2000). Symbol grounding and meaning: A comparison of high-dimensional and embodied theories of meaning. Journal of Memory and Language, 43(3), 379401.CrossRefGoogle Scholar
Jorge-Botana, G., León, J. A., Olmos, R.& Hassan-Montero, Y. (under review) Visualizing polysemic structures using LSA and the predication algorithm. Journal of the American society for Information science and Technology.Google Scholar
Juvina, I.& van Oostendorp, H. (2005). Bringing cognitive models into the domain of web accessibility. In Proceedings of the HCII2005 Conference, Las Vegas, USA.Google Scholar
Juvina, I., van Oostendorp, H., Karbor, P.& Pauw, B. (2005). Towards modeling contextual information in web navigation. In Bara, B. G. & Barsalou, L. & Bucciarelli, M. (Eds.), In Proceedings of the 27th Annual Meeting of the Cognitive Science Society, CogSci2005. Austin, Texas: The Cognitive Science Society, Inc, (pp. 10781083).Google Scholar
Kintsch, W. (1998). Comprehension: A paradigm for cognition. New York: Cambridge University Press.Google Scholar
Kintsch, W. (2000). Metaphor comprehension: A computational theory. Psychonomic Bulletin and Review, 7, 257266.CrossRefGoogle ScholarPubMed
Kintsch, W. (2001). Predication. Cognitive Science, 25, 173202.CrossRefGoogle Scholar
Kintsch, W. (2002). On the notion of theme and topic in psychological process models of text comprehension. In Louwerse, M. & Peer, W. van (Eds.), Thematics, Interdisciplinary Studies (pp. 157170). Amsterdam, John Benjamins B.V.CrossRefGoogle Scholar
Kintsch, W.& Bowles, A. (2002). Metaphor comprehension: What makes a metaphor difficult to understand? Metaphor and Symbol, 17, 249262.CrossRefGoogle Scholar
Kurby, C. A., Wiemer-Hastings, K., Ganduri, N., Magliano, J. P., Millis, K. K.& McNamara, D. S. (2003). Computerizing reading training: Evaluation of a latent semantic analysis space for science text. Behavior Research Methods, Instruments & Computers, 35, 244250.CrossRefGoogle ScholarPubMed
Landauer, T. K. (2002). On the computational basis of learning and cognition: Arguments from LSA. In Ross, N. (Ed.), The Psychology of Learning and Motivation: Advances in research and theory (pp. 4384). San Diego: Academic Press.Google Scholar
Landauer, T. K.& Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211240.CrossRefGoogle Scholar
Landauer, T. K., Foltz, P. W.& Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259284.CrossRefGoogle Scholar
Lemaire, B.& Denhière, G. (2006). Effects of High-Order Co-occurrences on Word Semantic Similarity. Current Psychology Letters, 18, 1.Google Scholar
Lee, M, Cimino, J, Zhu, H, Sable, C, Shanker, V, Ely, J et al. , Beyond information retrieval – Medical question answering. In Proceedings of the American Medical Informatics Association. Washington DC, USA; 2006.Google Scholar
Lemaire, B., Denhière, G., Bellissens, C.& Jhean-Larose, S. (2006). A Computational Model for Simulating Text Comprehension. Behavior Research Methods, 38(4), 628637.CrossRefGoogle ScholarPubMed
Mandl, T. (1999). Efficient Preprocessing for Information Retrieval with Neural Networks. In: Zimmermann, Hans-Jürgen (ed.): In Proceedings of the EUFIT '99. 7th European Congress on Intelligent Techniques and Soft Computing. Aachen, Germany, 13.Google Scholar
Mill, W.& Kontostathis, A. (2004). Analysis of the values in the LSI term-term matrix. Technical Report. http://webpages.ursinus.edu/akontostathis/MillPaper.pdfGoogle Scholar
Nakov, P., Popova, A.& Mateev, P. (2001). Weight functions impact on LSA performance. In Proceedings of the EuroConference RANLP'2001 (Recent Advances in NLP). Tzigov Chark, Bulgaria, 187193.Google Scholar
Pakhomov, S., Buntrock, J. D.& Chute, C. G. (2006). Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques. Journal of the American Medical Informatics Association, 13(5), 516525.CrossRefGoogle ScholarPubMed
Quesada, J. (2007). Creating Your Own LSA Spaces. In Landauer, T. K., McNamara, D., Dennis, S. & Kintsch, W. (Eds.), The handbook of Latent Semantic Analysis (pp. 7188). Mahwah, NJ: Erlbaum.Google Scholar
Quesada, J.F., Kintsch, W.& Gomez-Milán, E. (2001). A Computational Theory of Complex Problem Solving Using the Vector Space Model (part II): Latent Semantic Analysis Applied to Empirical Results from Adaptation Experiments. In Cañas, (Ed.) Cognitive research with Microworlds, (pp. 147158).Google Scholar
Rehder, B., Schreiner, M. E., Wolfe, M. B., Laham, D., Landauer, T. K.& Kintsch, W. (1998). Using Latent Semantic Analysis to assess knowledge: Some technical considerations. Discourse Processes, 25, 337354.CrossRefGoogle Scholar
Rosch, E.& Mervis, C. B. (1975). Family resemblances: Studies in the internal structures of categories. Cognitive Psychology, 7, 573605.CrossRefGoogle Scholar
Rumelhart, D., E., & McClelland, . (1992). Introducción al procesamiento distribuido en paralelo. Alianza Editorial, Madrid.Google Scholar
Chen, Rung-Ching, Lee, Ya-Ching & Pan, Ren-Hao (2006). Adding New Concepts On The Domain Ontology Based on Semantic Similarity, In Proceedings of the International Conference on Business and Information. July 12–14, 2006, Singapore.Google Scholar
Skoyles, J. R. (1999). Autistic language abnormality: Is it a secondorder context learning defect?: The view from Latent Semantic Analysis. In Barriere, I., Chiat, Morgan S. G.& Woll, B. (Eds.), In Proceedings of Child Language Seminar. London, pp 1.Google Scholar
Seidenberg, M. S.& McClelland, J. L. (1989). A Distributed, Developmental Model of Word Recognition and Naming. Psychological Review, 96, 523568.CrossRefGoogle ScholarPubMed
Serafin, R.& Di Eugenio, B. (2003). FLSA: Extending Latent Semantic Analysis with features for dialogue act classification. In Proceedings of ACL04, 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, July. (pp 692-es)Google Scholar
Schunn, C. D. (1999). The presence and absence of category knowledge in LSA. In the Proceedings of the 21st Annual Conference of the Cognitive Science Society. Mahwah, NJ: Erlbaum.Google Scholar
Turney, P. (2001). Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In De Raedt, L.& Flach, P. (Eds.). In Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), Freiburg, Germany, (pp. 491502).Google Scholar
Wiemer-Hastings, P., Wiemer-Hastings, K.& Graesser, A. (1999). Improving an intelligent tutor's comprehension of students with Latent Semantic Analysis. In Lajoie, S.P. and Vivet, M. (Eds.), Artificial Intelligence in Education (pp. 535542). Amsterdam: IOS Press.Google Scholar
Wiemer-Hastings, P. (2000). Adding syntactic information to LSA. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society. Erlbaum, Mahwah, NJ, (pp. 989993).Google Scholar
Wiemer-Hastings, P.& Zipitria, I. (2001). Rules for syntax, vectors for semantics. In Proceedings of the 23rd Cognitive Science Conference. Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
Wild, F., Stahl, C., Stermsek, G., & Neumann, G. (2005). Parameters Driving Effectiveness of Automated Essay Scoring with LSA. In Proceedings of the 9th International Computer Assisted Assessment Conference. Loughborough, UK, (pp. 485494).Google Scholar