Densification: Semantic document analysis using Wikipedia

IUSTIN DORNESCU; CONSTANTIN ORĂSAN

doi:10.1017/S1351324913000296

Densification: Semantic document analysis using Wikipedia

Published online by Cambridge University Press: 14 October 2013

IUSTIN DORNESCU and

CONSTANTIN ORĂSAN

Show author details

IUSTIN DORNESCU: Affiliation:
Research Institute in Information and Language Processing, University of Wolverhampton, Wolverhampton, UK e-mail: [email protected], [email protected]
CONSTANTIN ORĂSAN: Affiliation:
Research Institute in Information and Language Processing, University of Wolverhampton, Wolverhampton, UK e-mail: [email protected], [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper proposes a new method for semantic document analysis: densification, which identifies and ranks Wikipedia pages relevant to a given document. Although there are similarities with established tasks such as wikification and entity linking, the method does not aim for strict disambiguation of named entity mentions. Instead, densification uses existing links to rank additional articles that are relevant to the document, a form of explicit semantic indexing that enables higher-level semantic retrieval procedures that can be beneficial for a wide range of NLP applications. Because a gold standard for densification evaluation does not exist, a study is carried out to investigate the level of agreement achievable by humans, which questions the feasibility of creating an annotated data set. As a result, a semi-supervised approach is employed to develop a two-stage densification system: filtering unlikely candidate links and then ranking the remaining links. In a first evaluation experiment, Wikipedia articles are used to automatically estimate the performance in terms of recall. Results show that the proposed densification approach outperforms several wikification systems. A second experiment measures the impact of integrating the links predicted by the densification system into a semantic question answering (QA) system that relies on Wikipedia links to answer complex questions. Densification enables the QA system to find twice as many additional answers than when using a state-of-the-art wikification system.

Type: Articles
Information: Natural Language Engineering , Volume 20 , Issue 4 , October 2014 , pp. 469 - 500

DOI: https://doi.org/10.1017/S1351324913000296 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2013

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bentivogli, L., Forner, P., Giuliano, C., Marchetti, A., Pianta, E., and Tymoshenko, K. 2010. Extending English ACE 2005 Corpus annotation with ground-truth links to Wikipedia. In Proceedings of the 2nd Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, Beijing, China, August, pp. 19–27. Coling 2010 Organizing Committee.Google Scholar

Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 993–1022.Google Scholar

Bryl, V., Giuliano, C., Serafini, L., and Tymoshenko, K. 2010. Supporting natural language processing with background knowledge: coreference resolution case. In Patel-Schneider, P. F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J. Z., Horrocks, I., and Glimm, B. (eds.), The Semantic Web — ISWC 2010 (9th International Semantic Web Conference, Shanghai, China, Revised Selected Papers, Part I, volume 6496 of Lecture Notes in Computer Science), pp. 80–95. Berlin: Springer.Google Scholar

Bunescu, R. C., and Pasca, M. 2006. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), Trento, Italy, April, pp. 9–16. Association for Computational Linguistics.Google Scholar

Chu-Carroll, J., Czuba, K., Prager, J., and Ittycheriah, A. 2003. In question answering, two heads are better than one. In NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 24–31. Morristown, NJ: Association for Computational Linguistics.Google Scholar

Cilibrasi, R., and Vitányi, P. M. B., 2007. The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19 (3): 370–83.Google Scholar

Cohen, J., 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 37–46.Google Scholar

Cucerzan, S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, June, pp. 708–16. Association for Computational Linguistics.Google Scholar

Damljanovic, D., and Bontcheva, K. 2012. Named entity disambiguation using linked data. In 9th Extended Semantic Web Conference (ESWC2012), Heraklion, Greece.Google Scholar

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391–407.Google Scholar

Dornescu, I. 2010. Semantic QA for encyclopaedic questions: EQUAL in GikiCLEF. In Peters, C., Di Nunzio, G. M., Kurimo, M., Mandl, T., and Mostefa, D. (eds.), Multilingual Information Access Evaluation I. Text Retrieval Experiments (vol. 6241, Lecture Notes in Computer Science), pp. 326–33. Berlin: Springer.Google Scholar

Dornescu, I., 2012. Encyclopaedic Question Answering. PhD thesis. Wolverhampton: University of Wolverhampton, UK.Google Scholar

Dredze, M., McNamee, P., Rao, D., Gerber, A., and Finin, T. 2010. Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING '10, pp. 277–285. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Fleiss, J. L., 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76 (5): 378–82.CrossRef Google Scholar

Gabrilovich, E., and Markovitch, S., 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the Twentieth International Joint Conference for Artificial Intelligence, Hyderabad, India, pp. 1606–11.Google Scholar

Gottipati, S., and Jiang, J. 2011. Linking entities to a knowledge base with query expansion. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), Edinburgh, Scotland, pp. 804–813. Association for Computational Linguistics.Google Scholar

Han, X., and Sun, L. 2011. A generative entity-mention model for linking entities with knowledge base. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – Volume 1, HLT '11, pp. 945–954. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Harabagiu, S., Maiorano, S., and Pasca, M., 2003. Open-domain textual question answering techniques. Natural Language Engineering 9 (3): 231–67.Google Scholar

Harabagiu, S., and Moldovan, D. 2003. Question answering. In Mitkov, R. (ed.), The Oxford Handbook of Computational Linguistics, pp. 560–82. New York: Oxford University Press.Google Scholar

Hatcher, E., and Gospodnetic, O., 2004. Lucene in Action. Stanford, CT: Manning.Google Scholar

Hirschman, L., and Gaizauskas, R., 2001. Natural language question answering: the view from here. Natural Language Engineering 7 (4): 275–300.Google Scholar

Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, pp. 50–7. Association for Computing Machinery.CrossRef Google Scholar

Hovy, E., Gerber, L., Hermjakob, U., Lin, C.-Y., and Ravichandran, D. 2001. Toward semantics-based answer pinpointing. In HLT '01: Proceedings of the First International Conference on Human Language Technology Research, pp. 1–7. Morristown, NJ: Association for Computational Linguistics.Google Scholar

Jijkoun, V., Hofmann, K., Ahn, D., Khalid, M. A., van Rantwijk, J., de Rijke, M., and Sang, E. F. T. K. 2007. The university of Amsterdam's question answering system at QA@CLEF 2007. In Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D. W., Peñas, A., Petras, V., and Santos, D. (eds.), CLEF (vol. 5152, Lecture Notes in Computer Science), pp. 344–51. Berlin: Springer.Google Scholar

Kanerva, P., Kristofersson, J., and Holst, A. 2000. Random indexing of text samples for latent semantic analysis. In Gleitman, L. and Josh, A. (eds.), Proceedings of the 22nd Annual Conference of the Cognitive Science Society, vol. 1036, pp. 1036–7. Pennsylvania, PA: University of Pennsylvania.Google Scholar

Krippendorff, K., 2004. Content Analysis: An Introduction to Its Methodology. Thousand Oaks, CA: Sage.Google Scholar

Lang, K. 1995. Newsweeder: learning to filter netnews. In Prieditis, A. and Russell, S. J. (eds.), Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–9. San Francisco, CA: Morgan Kaufmann.Google Scholar

Li, C., Sun, A., and Datta, A., 2013. TSDW: two-stage word sense disambiguation using Wikipedia. Journal of the American Society for Information Science and Technology 64 (6): 1203–23.Google Scholar

Li, F., Zheng, Z., Bu, F., Tang, Y., Zhu, X., and Huang, M 2009. THU QUANTA at TAC 2009 KBP and RTE Track. In Proceedings of the 2009 Text Analysis Conference, Gaithersburg, MD, November. National Institute of Standards and Technology.Google Scholar

Li, X., and Roth, D. 2002. Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics-Volume 1, pp. 1–7. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

McNamee, P. 2009. Overview of the TAC 2009 knowledge base population track. In Proceedings of the 2009 Text Analysis Conference, Gaithersburg, MD, November. National Institute of Standards and Technology.Google Scholar

McNamee, P., Dredze, M., Gerber, A., Garera, N., Finin, T., Mayfield, J., Piatko, C., Rao, D., Yarowsky, D., and Dreyer, M. 2009. HLTCOE approaches to knowledge base population at TAC 2009. In Proceedings of the 2009 Text Analysis Conference, Gaithersburg, MD, November. National Institute of Standards and Technology.Google Scholar

McNamee, P., Stoyanov, V., Mayfield, J., Finin, T., Oates, T., Xu, T., Oard, D., and Lawrie, D. 2012. HLTCOE Participation at TAC 2012: entity linking and cold start knowledge base construction. In Proceedings of the Fifth Text Analysis Conference (TAC 2012), Gaithersburg, MD, November. National Institute of Standards and Technology.Google Scholar

Mendes, P. N., Jakob, M., García-Silva, A., and Bizer, C. 2011. DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems (I-Semantics), pp. 1–8. New York: Association for Computing Machinery.Google Scholar

Mihalcea, R., and Csomai, A. 2007. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM 2007), pp. 233–242. New York: Association for Computing Machinery.Google Scholar

Milne, D., and Witten, I. H. 2008a. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), pp. 25–30. Menlo Park, CA: AAAI Press.Google Scholar

Milne, D., and Witten, I. H. 2008b. Learning to link with Wikipedia. In Shanahan, J. G., Amer-Yahia, S., Manolescu, I., Zhang, Y., Evans, D. A., Kolcz, A., Choi, K.-S., and Chowdhury, A. (eds.), Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), pp. 509–518. New York: Association for Computing Machinery.Google Scholar

Moldovan, D., Clark, C., Harabagiu, S., and Hodges, D., 2007. Cogex: a semantically and contextually enriched logic prover for question answering. Journal of Applied Logic 5 (1): 49–69.CrossRef Google Scholar

Rizzo, G., and Troncy, R., 2011. NERD: evaluating named entity recognition tools in the web of data. In Proceedings of the ISWC’11 Workshop on Web Scale Knowledge Extraction (WEKEX’11), Bonn, Germany, pp. 1–16.Google Scholar

Sahlgren, M. 2005. An introduction to random indexing. In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE, Copenhagen, Denmark.Google Scholar

Santos, D., and Cabral, L. M. 2009. GikiCLEF: expectations and lessons learned. In Peters, C., Di, G. M. Nunzio, Kurimo, M., Mostefa, D., Peñas, A., and Roda, G. (eds.), CLEF 1 (vol. 6241, Lecture Notes in Computer Science), pp. 212–22. Berlin: Springer.Google Scholar

Santos, D., Cardoso, N., Carvalho, P., Dornescu, I., Hartrumpf, S., Leveling, J., and Skalban, Y. 2009. GikiP at GeoCLEF 2008: joining GIR and QA forces for querying Wikipedia. In Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Peñas, A., Jones, G. J. F., Kurimo, M., Mandl, T., and Petras, V. (eds.), Proceedings of the 9th Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access (vol. 5706, Lecture Notes in Computer Science), pp. 894–905. Berlin: Springer.Google Scholar

Schlaefer, N., Ko, J., Betteridge, J., Sautter, G., Pathak, M., and Nyberg, E. 2007. Semantic extensions of the Ephyra QA system for TREC 2007. In Voorhees, E. M. and Buckland, L. P. (eds.), Proceedings of the Sixteenth Text REtrieval Conference (TREC), Gaithersburg, MD: National Institute of Standards and Technology.Google Scholar

Scott, W. A., 1955. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly 19 (3): 321–5.Google Scholar

Sim, J., and Wright, C. C., 2005. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical Therapy 85 (3): 257–68.Google Scholar

Spearman, C., 1904. The proof and measurement of association between two things. American Journal of Psychology 15 (1): 72–101.Google Scholar

Voorhees, E. M., 2001. The TREC question answering track. Natural Language Engineering 7 (4): 361–78.Google Scholar

Article contents

Densification: Semantic document analysis using Wikipedia

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests