Hostname: page-component-586b7cd67f-t7czq Total loading time: 0 Render date: 2024-11-28T17:28:28.580Z Has data issue: false hasContentIssue false

Using patterns of thematic progression for building a table of contents of a text

Published online by Cambridge University Press:  01 April 2008

MARIE-FRANCINE MOENS*
Affiliation:
Interdisciplinary Centre for Law and Information Technology, Katholieke Universiteit Leuven, Tienstraat 41, B-3000 Leuven, Belgium e-mail: [email protected]

Abstract

A text usually contains one or a few main topics, which are split up into subtopics, which in their turn can be further described by more detailed topics. In this article we describe a system that segments a text into topics and subtopics. Each segment is characterized by important key terms that are extracted from it and by its begin and end position in the text. A table of contents is built by using the hierarchical and sequential relationships between topical segments that are identified in a text. The table of contents generator relies upon universal linguistic theories on the topic and comment of a sentence and on patterns of thematic progression in text. The linguistic theories of topic and comment are modeled both deterministically and probabilistically. The system is applied to English texts (news, World Wide Web and encyclopedia texts) and is evaluated.

Type
Papers
Copyright
Copyright © Cambridge University Press 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Angheluta, R., Mitra, R., Jing, X. and Moens, M.-F. (2004) K.U.Leuven summarization system at DUC-2004. DUC Workshop Papers and Agenda, pp. 5360. Boston.Google Scholar
Barzilay, R. and Elhadad, M. (1999) Using lexical chains for text summarization. In: Mani, I. and Maybury, M. T. (eds.), Advances in Automatic Text Summarization, pp. 111121. Cambridge, MA: MIT Press.Google Scholar
Beeferman, D., Berger, A. and Lafferty, J. (1999) Statistical models for text segmentation. Machine Learning 34: 177210.CrossRefGoogle Scholar
Berger, A., Della Pietra, S. and DellaPietra, V. Pietra, V. (1996) A maximum entropy approach to natural language processing. Computational Linguistics 22 1: 3971.Google Scholar
Buyukkokten, O.Garcia-Molina, H. and Paepcke, A. (2001) Seeing the whole in parts: text summarization for web browsing on handheld devices. Proceedings of the World Wide Web Conference 10, pp. 652662. New York: ACM.Google Scholar
Carletta, J. (1996) Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics 22 2: 249254.Google Scholar
Chali, I., Kolla, M., Singh, N. and Zhang, Z. (2003) The University of Lethbridge text summarizer at DUC- 2003. In: Radev, D. and Teufel, S. (eds.), Proceedings of the Text Summarization Workshop and 2003 Document Understanding Conference, pp. 148152. Gaithersburg, MD: NIST.Google Scholar
Choi, F. Y. Y. (2000) Advances in domain independent linear text segmentation. Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 26–33.Google Scholar
Croft, W. (1990) Typology and Universals. Cambridge, UK: Cambridge University Press.Google Scholar
Dane, F. (1974) Functional sentence perspective and the organization of the text. In: F., Dane, (ed.), Papers on Functional Sentence Perspective, pp. 106–128. The Hague: Mouton.CrossRefGoogle Scholar
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, series B 39 1: 138.Google Scholar
Dunning, T. (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19: 6174.Google Scholar
Fries, P. H. (1994) On theme, rheme and discourse goals. In Coulthard, M. (ed.), Advances in Written Text Analysis, pp. 229249. London: Routledge.Google Scholar
Givón, T. (1983) Introduction. In: Givón, T. (ed.). Topic Continuity in Discourse: A Quantitative Cross-Language Study, pp. 141. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Givón, T. (1988) The pragmatics of word-order: Predictability, importance and attention. In: Hammond, M.Moravcsik, E. and Wirth, J. (eds.), Studies in Syntactic Typology, pp. 243284. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Givón, T. (2001) Syntax: An Introduction. Amsterdam: John Benjamin.Google Scholar
Gregory, M. L. and Michaelis, L. A. (2001) Topicalization and left-dislocation: A functional opposition revisited. Journal of Pragmatics 33 11: 16651706.CrossRefGoogle Scholar
Grosz, B. J. and Sidner, C. L. (1998) Lost intuitions and forgotten intentions. In: Walker, M. A.Joshi, A. K. and Prince, E. F. (eds.), Centering Theory in Discourse, pp. 3951. Oxford, UK: Clarendon Press.Google Scholar
Gundel, J. (1988) Universals of topic-comment structure. In:Hammond, M.Moravcsik, E. and Wirth, J. (eds.), Studies in Syntactic Typology, pp. 209239. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Gundel, J. (1999) On different kinds of focus. In: Bosch, P. and Sandt, R. van der (eds.), Focus: Linguistic, Cognitive and Computational Perspectives, pp. 293305. Cambridge, UK: University Press.Google Scholar
Hahn, U. (1990) Topic parsing: accounting for text macro structures in full-text analysis. Information Processing and Management 26 1: 135170.CrossRefGoogle Scholar
Hajiová, E. (1994) Topic/focus and related research. In P.A., Luelsdorff (ed.), The Prague School of Structural and Functional Linguistics, pp. 245–275. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Hajiová, E. and Sgall, P. (1988) Topic and focus of a sentence and the patterning of a text. In: J. S., Petfi (ed.), Text and Discourse Constitution: Empirical Aspects, Theoretical Approaches, pp. 70–96. Berlin: Walter de Gruyter.CrossRefGoogle Scholar
Halliday, M. A. K. (1967) Notes on transitivity and theme in English, part II. Journal of Linguistics: 189–202.Google Scholar
Halliday, M. A. K. (1976) Theme and information in the English clause. In: Kress, G. R. and Halliday, M. A. K. (eds.), Halliday: System and Function in Language, pp. 174188. London: Oxford University Press.Google Scholar
Hearst, M. A. (1997) TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23 1: 3364.Google Scholar
Hearst, M. A. and Plaunt, C. (1993) Subtopic structuring for full-length document access. In: Korfhage, R.Rasmussen, E. and Willett, P. (eds.), Proceedings of the Sixteenth SIGIR Conference, pp. 5968. New York: ACM.Google Scholar
Hinds, J. (1979) Organizational patterns in discourse. In: Givón, T. (ed.), Syntax and Semantics 12. Discourse and Syntax, pp. 135157. New York: Academic Press.Google Scholar
Hopper, P. J. (1979) Aspect and foregrounding in discourse. In: Givón, T. (ed.), Syntax and Semantics 12. Discourse and Syntax, pp. 213241. New York: Academic Press.Google Scholar
Kan, M.-Y. (2003) Automatic Text Summarization as Applied to Information Retrieval. Using Indicative and Informative Summaries. PhD thesis Columbia University, NY.Google Scholar
Kan, M.-Y.Klavans, J. L. and McKeown, K. R. (1998) Linear segmentation and segment relevance. In Proceedings of the 6th International Workshop of Very Large Corpora (WVLC-6), Montréal, Québec, Canada: August 1998, pp. 197–205.Google Scholar
Kan, M.-Y.McKeown, K. R. and Klavans, J. L. (2001) Domain-specific informative and indicative summarization for information retrieval. In D. Harman and D. Marcu (eds.), Proceedings of DUC 2001 Workshop on Text Summarization.Google Scholar
Kieras, D. E. (1985) Thematic processes in the comprehension of technical prose. In: Britton, B. K. and Black, J. B. (eds.), Understanding Expository Text, pp. 89107. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
Kintsch, W. (2002) On the notions of theme and topic in psychological process models of text comprehension. In Louwerse, M. and van Peer, W. (eds.), Thematics: Interdisciplinary Studies, pp. 157170. Amsterdam: Benjamins.CrossRefGoogle Scholar
Kintsch, W. and vanDijk, T. A. Dijk, T. A. (1978) Toward a model of text comprehension and production. Psychological Review 85 5: 363394.CrossRefGoogle Scholar
Kononenko, I., Kononenko, S., Popov, I. and Zagorulko, Y. (2000) Information extraction from nonsegmented text. RIAO'2000 Content-Based Multimedia Information Access. Paris.Google Scholar
Li, H. and Yamanishi, K. (2003) Topic analysis using a finite mixture model. Information Processing and Management, 39 4: 521541.CrossRefGoogle Scholar
Marcu, D. (2000) The Theory and Practice of Discourse Parsing and Summarization. Cambridge, MA: The MIT Press.CrossRefGoogle Scholar
Meinunger, A. (2000) Syntactic Aspects of Topic and Comment. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Mikheev, A. (1998) Part-of-Speech Guessing Rules: Learning and Evaluation.Google Scholar
Mitra, R., Angheluta, R., Jeuniaux, P. and Moens, M.-F. (2003) Progressive fuzzy clustering for noun phrase coreference resolution. Proceedings of the Fourth Dutch-Belgian Information Retrieval Workshop DIR-2003, pp. 1015. Amsterdam: CWI.Google Scholar
Moens, M.-F. and Angheluta, R. (2003) Concept extraction from legal cases: The use of a statistic of coincidence. Proceedings of the Eight International Conference on Artificial Intelligence and Law, pp. 142146. New York: ACM.Google Scholar
Moens, M.-F.Angheluta, R. and Dumortier, J. (2005) Generic technologies for single- and multi-document summarization. Information Processing and Management, 41 3: 569586.CrossRefGoogle Scholar
Morris, J. and Hirst, G. (1991) Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics 17 1: 2143.Google Scholar
Over, P. and Yen, J. (2003) An introduction to DUC-2003: Intrinsic evaluation of generic news text summarization systems. Proceedings of the 2003 Document Understanding Conference. Gaithersburg, MD: NIST.Google Scholar
Paducheva, E. V. (1996) Theme-rheme structure: Its exponents and its semantic interpretation. In: B. H., Partee and P.Sgall (eds.), Discourse and Meaning. Papers in Honor of Eva Hajiová, pp. 273–287. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Peregrin, J. (1996) Topic and focus in a formal framework. In B. Partee and P. Sgall (eds.), Discourse and Meaning: Papers in Honor of Eva Hajiová, pp. 235–254. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Ponte, J. M. and Croft, W. B. (1997) Text segmentation by topic. Proceedings of the first European Conference on Research and Advanced Technology for Digital Libraries, pp. 120–129.CrossRefGoogle Scholar
Prikhod'ko, S. M. and Skorokhod'ko, E. F. (1982) Automatic abstracting from analysis of links between phrases. Nauchno-Tekhnicheskaya Informatsiya, Seriya 216 1: 2732.Google Scholar
Ratnaparkhi, A. (1998) Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania.Google Scholar
Reinhart, T. (1982) Pragmatics and Linguistics: An Analysis of Sentence Topics. Indiana University Linguistics Club, Bloomington Indiana.CrossRefGoogle Scholar
Roberts, C. (1998) The place of centering in a general theory of anaphora resolution. In: Walker, M. A.Joshi, A. K. and Prince, F. (eds.), Centering Theory in Discourse, pp. 359399. Oxford, UK: Clarendon Press.Google Scholar
Salton, G., Allan, J., Buckley, C. and Singhal, A. (1994) Automatic analysis, theme generation, and summarization of machine-readable texts. Science 264: 14211426.CrossRefGoogle ScholarPubMed
Salton, G., Singhal, A., Buckley, C. and Mitra, M. (1996) Automatic text decomposition using text segments and text themes. Hypertext 96: 53–65.CrossRefGoogle Scholar
Sanderson, M. and Croft, W. B. (1999) Deriving concept hierarchies from texts. Proceedings of the 22nd International Conference on Research and Development in Information Retrieval, pp. 206213. New York: ACM.Google Scholar
Sidner, C. L. (1983) Focusing in the comprehension of definite anaphora. In: Brady, M. and Berwick, R. C. (eds.), Computational Models of Discourse, pp. 267330. Cambridge, MA: The MIT Press.Google Scholar
Sormunen, E., Kekäläinen, J., Koivisto, J. and Järvelin, K. (2001) Document text characteristics affect the ranking of the most relevant documents by expanded structured queries. Journal of Documentation 57 3: 358376.CrossRefGoogle Scholar
Tomlin, R. S., Forrest, L., Pu, M. M. and Kim, M. H. (1997) Discourse semantics. In: T. A. van Dijk (ed.), Discourse as Structure and Process Discourse Studies: A Multidisciplinary Introduction 1), pp. 63111. London: Sage.CrossRefGoogle Scholar
Van Dijk, T. A. (1988) News as Discourse. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
Van Dijk, T. A. (1997) The study of discourse. In: van Dijk, T. A. (ed.), Discourse as Structure and Process Discourse Studies: A Multidisciplinary Introduction 1), pp. 134. London: Sage.Google Scholar
Yaari, Y. (2000) NLP-assisted exploration of texts. In Proceedings RIAO'2000 Content-Based Multimedia Information Access. Paris: CID-CASIS.Google Scholar
Yang, C. and Wang, F. L. (2003) Fractal summarization for mobile devices to access large documents on the Web. Proceedings of the International World Wide Web Conference, Budapest, Hungary. New York: ACM.Google Scholar
Zizi, M. and Beaudouin-Fafon, M. (1995) Hypermedia exploration with interactive dynamic maps. International Journal Human-Computer Studies 43 3: 441464.CrossRefGoogle Scholar