Hostname: page-component-78c5997874-v9fdk Total loading time: 0 Render date: 2024-11-09T12:58:55.940Z Has data issue: false hasContentIssue false

Automatic summarisation of discussion fora

Published online by Cambridge University Press:  24 March 2010

ALMER S. TIGELAAR
Affiliation:
Database and Human Media Interaction Groups, University of Twente, The Netherlands e-mail: [email protected], [email protected], [email protected]
RIEKS OP DEN AKKER
Affiliation:
Database and Human Media Interaction Groups, University of Twente, The Netherlands e-mail: [email protected], [email protected], [email protected]
DJOERD HIEMSTRA
Affiliation:
Database and Human Media Interaction Groups, University of Twente, The Netherlands e-mail: [email protected], [email protected], [email protected]

Abstract

Web-based discussion fora proliferate on the Internet. These fora consist of threads about specific matters. Existing forum search facilities provide an easy way for finding threads of interest. However, understanding the content of threads is not always trivial. This problem becomes more pressing as threads become longer. It frustrates users that are looking for specific information and also makes it more difficult to make valuable contributions to a discussion. We postulate that having a concise summary of a thread would greatly help forum users. But, how would we best create such summaries? In this paper, we present an automated method of summarising threads in discussion fora. Compared with summarisation of unstructured texts and spoken dialogues, the structural characteristics of threads give important advantages. We studied how to best exploit these characteristics. Messages in threads contain both explicit and implicit references to each other and are structured. Therefore, we term the threads hierarchical dialogues. Our proposed summarisation algorithm produces one summary of an hierarchical dialogue by ‘cherry-picking’ sentences out of the original messages that make up a thread. We try to select sentences usable for obtaining an overview of the discussion. Our method is built around a set of heuristics based on observations of real fora discussions. The data used for this research was in Dutch, but the developed method equally applies to other languages. We evaluated our approach using a prototype. Users judged our summariser as very useful, half of them indicating they would use it regularly or always when visiting fora.

Type
Papers
Copyright
Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agresti, A. 2002. Categorical Data Analysis, p. 68, 2nd ed.New York: Wiley-Interscience.CrossRefGoogle Scholar
op den Akker, R., Hospers, M., Kroezen, E., Nijholt, A., and Lie, D. 2002. A rule-based reference resolution method for dutch discourse analysis. In Proceedings of International Symposium on Reference Resolution in NLP, Alicante, Spain, pp. 5966.Google Scholar
Baldwin, T., Martinez, D., and Penman, R. B. 2007. Automatic thread classification for linux user forum information access. In Proceedings of ADCS, Melbourne, Australia, pp. 72–9.Google Scholar
Bird, S., Klein, E., and Loper, E. 2008. Natural language processing in python. http://nltk.sourceforge.net/index.php/Book (Draft Version 0.9.2).Google Scholar
Bogers, T. 2004. Dutch Named Entity Recognition: Optimizing Features, Algorithms, and Output. Master's thesis, University of Tilburg.Google Scholar
Bouma, G., van Noord, G., and Malouf, R. 2000. Alpino: wide-coverage computational analysis of Dutch. In Proceedings of CLIN, Tilburg, The Netherlands, pp. 4559.Google Scholar
Carenini, G., Ng, R. T., and Zhou, X. 2007. Summarizing email conversations with clue words. In Proceedings of WWW, Banff, AB, Canada, pp. 91100.Google Scholar
Coleman, M., and Liau, T. L. 1975. A computer readability formula designed for machine scoring. Journal of Applied Psychology 60 (2): 283–84.CrossRefGoogle Scholar
Dalli, A., Yunqing, X., and Wilks, Y. 2004. FASIL Email summarisation system. In Proceedings of COLING, Geneva, Switzerland, pp. 9941001.CrossRefGoogle Scholar
van Deemter, K., and Kibble, R. 2000. On coreferring: coreference in MUC and related annotation schemes. Computational Linguistics 26 (2): 629–37.CrossRefGoogle Scholar
DuBay, W. H. 2004. The principles of readability. Technical Report, Impact Information. http://www.impact-information.com/impactinfo/readability02.pdf.Google Scholar
van Eynde, F. 2004. Part of Speech Tagging en Lemmatisering van het Corpus Gesproken Nederlands. Centre for Computerlinguistics, Catholic University of Leuven.Google Scholar
Farell, R. 2002. Summarizing electronic discourse. International Journal of Intelligent Systems in Accounting, Finance & Management 11: 2338.Google Scholar
Farell, R., Fait-weather, P. G., and Snyder, K. 2001. Summarization of discussion groups. In Proceedings of CIKM, Atlanta, GA, pp. 532–34.CrossRefGoogle Scholar
Feng, D., Shaw, E., Kim, J., and Hovy, E. 2006. Learning to detect conversation focus of threaded discussions. In Proceedings of HLT-NAACL, New York, pp. 208–15.Google Scholar
Francis, W. N., and Kûcera, H. 1979. Brown corpus manual. http://icame.uib.no/brown/bcm.htmlGoogle Scholar
Hoste, V., and Daelemans, W. 2005. Learning Dutch coreference resolution. In Proceedings of CLIN'04, Leiden, The Netherlands.Google Scholar
Hoste, V., and van den Bosch, A. 2007. A modular approach to learning Dutch co-reference resolution. In Proceedings of WAR I, Bergen, Norway, pp. 5175.Google Scholar
Hovy, E. 2004. The Oxford Handbook of Computational Linguistics: Text Summarization, chapter 32, pp. 583–98. Oxford, UK: Oxford University Press.Google Scholar
Hovy, E., Hermjakob, U., and Ravichandran, D. 2002. Qtargets used in webclopedia. http://www.isi.edu/natural-language/projects/webclopedia/TaxonomyGoogle Scholar
Jurafsky, D., and Martin, J. H. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, p. 340. Upper Saddle River, NJ: Prentice-Hall.Google Scholar
Kim, J., Chem, G., Feng, D., Shaw, E., and Hovy, E. 2006a Mining and assessing discussions on the web through speech act analysis. In Proceedings of ISWC, Athens, GA.Google Scholar
Kim, J., Chem, G., Feng, D., Shaw, E., and Hovy, E. 2006b Modeling and assessing student activities in on-line discussions. In Proceedings of AAAI EDM. Boston, MA.Google Scholar
Kiss, T., and Strunk, J. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32 (4): 485525.CrossRefGoogle Scholar
Klaas, M. 2005. Toward indicative discussion fora summarization. Technical Report UBC-CS TR-2005-04, University of British Columbia.Google Scholar
Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46 (5): 604632.CrossRefGoogle Scholar
Lam, D., Rohall, S. L., Schmandt, C., and Stern, M. K. 2002. Exploiting e-mail structure to improve summarization. In Proceedings of CSCW (Interactive Posters), New Orleans, LA.Google Scholar
Lang, K. 1995. Newsweeder: learning to filter netnews. In Proceedings of ICML, Tahoe City, CA, pp. 331–39.Google Scholar
Lin, C.-Y. 2004. Looking for a few good metrics: ROUGE and its evaluation. In Proceedings of NTCIR Workshop, Tokyo, Japan, pp. 1765–76.Google Scholar
Manning, C., and Schütze, H. 1999. Foundations of Statistical Natural Language Processing, p. 371. Cambridge, MA: MIT Press.Google Scholar
McKeown, K., Shrestha, L., and Rambow, O. 2007. Using question-answer pairs in extractive summarization of email conversations. In Proceedings of CICLing, Mexico City, Mexico, pp. 542–50.Google Scholar
Mitkov, R. 1999. Multilingual anaphora resolution. Machine Translation 14 (3–4): 281–99.CrossRefGoogle Scholar
Rambow, O., Shrestha, L., Chen, J., and Lauridsen, C. 2004. Summarizing email threads. In Proceedings of HTL/NAACL Short Papers, Boston, MA, pp. 105–8.Google Scholar
Ratcliff, J. W., and Metzener, D. M. 1988. Gestalt: an introduction to the Ratcliff/Obershelp pattern matching algorithm. Dr. Dobbs Journal, 7, p. 46.Google Scholar
Rienks, R. 2007. Meetings in Smart Environments: Implications of Progressing Technology. Ph.D. thesis, University of Twente.Google Scholar
Sang, E. T. K. 2005. Language-independent named entity recognition. http://www.cnts.ua.ac.be/conll2002/ner/Google Scholar
Schuth, A., Marx, M., and de Rijke, M. 2007. Extracting the discussion structure in comments on news-articles. In Proceedings of CIKM/WIDM, Lisbon, Portugal, vol. 123, pp. 97104.Google Scholar
Wan, S., and McKeown, K. 2004. Generating overview summaries of ongoing email thread discussions. In Proceedings of COLING, Geneva, Switzerland, pp. 549–56.CrossRefGoogle Scholar
Weimer, M., Gurevych, I., and Mühlhäuser, M. 2007. Automatically assessing the post quality in online discussions on software. In Proceedings of ACL Demo and Poster Sessions, Prague, Czech Republic, pp. 125–28.Google Scholar
Zechner, K. 2002. Automatic summarization of open-domain multiparty dialogues in diverse genres. Computational Linguistics 28 (4): 447485.CrossRefGoogle Scholar