Hostname: page-component-cd9895bd7-gvvz8 Total loading time: 0 Render date: 2024-12-24T16:32:23.908Z Has data issue: false hasContentIssue false

Discourse analysis based segregation of relevant document segments for knowledge acquisition

Published online by Cambridge University Press:  04 October 2016

N. Madhusudanan*
Affiliation:
Virtual Reality Laboratory, Centre for Product Design and Manufacturing, Indian Institute of Science, Bangalore, India
Amaresh Chakrabarti
Affiliation:
Virtual Reality Laboratory, Centre for Product Design and Manufacturing, Indian Institute of Science, Bangalore, India
B. Gurumoorthy
Affiliation:
Virtual Reality Laboratory, Centre for Product Design and Manufacturing, Indian Institute of Science, Bangalore, India
*
Reprint requests to: N. Madhusudanan, Virtual Reality Laboratory, Centre for Product Design and Manufacturing, Indian Institute of Science, Bangalore 560 012, India. E-mail: [email protected]

Abstract

Documents are a useful source of expert knowledge in organizations and can be used to foresee, in an earlier stage of a product's life cycle, potential issues and solutions that might occur in later stages of its life cycle. In this research, these stages are, respectively, design and assembly. Even if these documents are available online, it is rather difficult for users to access the knowledge contained in these documents. It is therefore desirable to automatically extract the knowledge contained in these documents and store them in a computer accessible or manipulable form. This paper describes an approach for the first step in this acquisition process: automatically identifying segments of documents that are relevant to aircraft assembly, so that they can be further processed for acquiring expert knowledge. Such identification of relevant segments is necessary for avoiding processing of unrelated information that is costly and possibly distracting for domain relevance. The approach to extracting relevant segments has two steps. The first step is the identification of sentences that form a coherent segment of text, within which the topic does not shift. The second step is to classify segments that are within the topics of interest for knowledge acquisition, that is, aircraft assembly in this instance. These steps filter out segments that are unrelated, and therefore need not be processed for subsequent knowledge acquisition. The steps are implemented by understanding the contents of documents. Using methods of discourse analysis, in particular, discourse representation theory, a list of discourse entities is obtained. The difference in discourse entities between sentences is used to distinguish between segments. The list of discourse entities in a segment is compared against a domain ontology for classification. The implementation and results of validation on sample texts for these steps are described.

Type
Special Issue Articles
Copyright
Copyright © Cambridge University Press 2016 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES

Alavi, M., & Leidner, D.E. (2001). Review: knowledge management and knowledge management systems: conceptual foundations and research issues. MIS Quarterly 25(1), 107136.CrossRefGoogle Scholar
Allen, J. (2011). Natural Language Understanding, 2nd ed. New York: Pearson.Google Scholar
Andrews, N.O., & Fox, E.A. (2007). Recent developments in document clustering. Technical Report TR-07-35. Blacksburg, VA: Virginia Tech, Computer Science.Google Scholar
Ast, M., Glas, M., Roehm, T., & Luftfahrt, V.B. (2014). Creating an Ontology for Aircraft Design. Bonn: Deutsche Gesellschaft für Luft-und Raumfahrt-Lilienthal-Oberth eV.Google Scholar
Beeferman, D., Berger, A., & Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning 34(1–3), 177210.CrossRefGoogle Scholar
Blackburn, P., & Bos, J. (2006). Working With Discourse Representation Theory: An Advanced Course in Computational Semantics. Accessed at http://ling.uni-konstanz.de/pages/home/butt/main/material/bb-drt.pdf Google Scholar
Bos, J., (2008). Wide-coverage semantic analysis with boxer. Proc. 2008 Conf. Semantics in Text Processing, pp. 277–286. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Chandrasegaran, S.K., Ramani, K., Sriram, R.D., Horváth, I., Bernard, A., Harik, R.F., & Gao, W. (2013). The evolution, challenges, and future of knowledge representation in product design systems. Computer-Aided Design 45(2), 204228.CrossRefGoogle Scholar
Chen, H. (2010). Learning semantic structures from in-domain documents. PhD Thesis, Massachusetts Institute of Technology.Google Scholar
Curran, J.R., Clark, S., & Bos, J. (2007). Linguistically motivated large-scale NLP with C&C and Boxer. Proc. 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 33–36. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Feigenbaum, E.A. (2003). Some challenges and grand challenges for computational intelligence. Journal of the ACM 50(1), 3240.Google Scholar
Foltz, P.W., Kintsch, W., & Landauer, T.K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes 25(2–3), 285307.Google Scholar
Fraser, B. (1999). What are discourse markers? Journal of Pragmatics 31(7), 931952.Google Scholar
Giora, R. (2003). Segmentation and segment cohesion: on the thematic organization of the text. Text-Interdisciplinary Journal for the Study of Discourse 3(2), 155182.Google Scholar
Goller, C., Löning, J., Will, T., & Wolff, W. (2000). Automatic document classification—a thorough evaluation of various methods. Proc. ISI 2000, pp. 145–162. Cuernavaca, Mexico, October 10–14.Google Scholar
Grosz, B.J., & Sidner, C.L. (1986). Attention, intentions, and the structure of discourse. Computational Linguistics 12(3), 175204.Google Scholar
Gruber, T.R. (1989). Automated knowledge acquisition for strategic knowledge. Machine Learning 4(3–4), 293336.CrossRefGoogle Scholar
Han, X., & Sun, L. (2012). An entity-topic model for entity linking. Proc. 2012 Joint Conf. Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 105–115. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Hearst, M.A. (1994). Multi-paragraph segmentation of expository text. Proc. 32nd Annual Meeting on Association for Computational Linguistics, pp. 9–16. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Hoque, A.S.M., & Szecsi, T. (2007). Application of design-for-manufacture (DFM) rules in CAD/CAM. Proc. 3rd I*PROMS Virtual Conf., Cardiff, July 2–13.Google Scholar
Hossain, M.S., & Angryk, R.A. (2007). Gdclust: a graph-based document clustering technique. Proc. 7th IEEE Int. Conf. Data Mining Workshops, 2007/ICDM Workshops 2007, pp. 417–422, Omaha, NE, October 28–31.Google Scholar
Kamp, H., & Reyle, U. (1993). From Discourse to Logic: Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Discourse Representation theory. No. 42. Berlin: Springer Science & Business Media.Google Scholar
Kataria, S.S., Kumar, K.S., Rastogi, R.R., Sen, P., & Sengamedu, S.H. (2011). Entity disambiguation with hierarchical topic models. Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 1037–1045. New York: ACM.Google Scholar
Lascarides, A., & Asher, N. (2008). Segmented discourse representation theory: dynamic semantics with discourse structure. In Computing Meaning, pp. 87–124. Dordrecht: Springer.CrossRefGoogle Scholar
Le Thanh, H., Abeysinghe, G., & Huyck, C. (2004). Automated discourse segmentation by syntactic information and cue phrases. Proc. IASTED Int. Conf. Artificial Intelligence and Applications (AIA 2004), Innsbruck, Austria.Google Scholar
Li, Y., Chung, S.M., & Holt, J.D. (2008). Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering 64(1), 381404.Google Scholar
Liu, B., Li, X., Lee, W.S., & Yu, P.S. (2004). Text classification by labeling words. Proc. AAAI, Vol. 4, pp. 425–430. Cambridge, MA: MIT Press.Google Scholar
Liu, S., McMahon, C.A., & Culley, S.J. (2008): A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management. Computers in Industry 59(1), 316.Google Scholar
Liu, S., McMahon, C.A., Darlington, M.J., Culley, S.J., & Wild, P.J. (2006). A computational framework for retrieval of document fragments based on decomposition schemes in engineering information management. Advanced Engineering Informatics 20(4), 401413.Google Scholar
Liu, T.I., Yang, X.M., & Kalambur, G.J. (1995). Design for machining using expert system and fuzzy logic approach. Journal of Materials Engineering and Performance 4(5), 599609.CrossRefGoogle Scholar
Loftus, C., Hicks, B., & McMahon, C. (2009). Capturing key relationships and stakeholders over the product life cycle: an email based approach. Proc. 6th In. Conf. Project Life Cycle Management (PLM 09), Bath, July 6–8.Google Scholar
Loper, E., & Bird, S. (2002). NLTK: the natural language toolkit. Proc. ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Vol. 1. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Madhusudanan, N., & Chakrabarti, A. (2014). A questioning based method to automatically acquire expert assembly diagnostic knowledge. Computer-Aided Design 57, 114.Google Scholar
Marx, W.J., Mavris, D.N., & Schrage, D.P. (1998). A knowledge-based system integrated with numerical analysis tools for aircraft life-cycle design. Artificial Intelligence for Engineering, Design Analysis and Manufacturing 12(3), 211229.Google Scholar
Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. Proc. AAAI, Vol. 6. Cambridge, MA: MIT Press.Google Scholar
Miller, G.A. (1995). WordNet: a lexical database for English. Communications of the ACM 38(11), 3941.Google Scholar
Morris, J., & Hirst, G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics 17(1), 2148.Google Scholar
Mozina, M., Guid, M., Krivec, J., Sadikov, A., & Bratko, I. (2008). Fighting knowledge acquisition bottleneck with argument based machine learning. Proc. European Conf. Artificial Intelligence, pp. 234–238, Patras, Greece, July 21–25.Google Scholar
Mu, J., Stegmann, K., Mayfield, E., Rosé, C., & Fischer, F. (2012). The ACODEA framework: developing segmentation and classification schemes for fully automatic analysis of online discussions. International Journal of Computer-Supported Collaborative Learning 7(2), 285305.Google Scholar
Nyberg, K. (2011). Document classification using machine learning and ontologies. MS Thesis, Aalto University, School of Science, Degree Programme of Information Networks.Google Scholar
Park, J.-H., & Seo, K.K. (2003). Knowledge-based approximate life cycle assessment system in the collaborative design environment. Proc. 3rd Int. Symp. Environmentally Conscious Design and Inverse Manufacturing, 2003. EcoDesign'03, Tokyo, December 11–13.Google Scholar
Passonneau, R.J., &. Litman, D.J. (1997). Discourse segmentation by human and automated means. Computational Linguistics 23(1), 103139.Google Scholar
Pevzner, L., & Hearst, M.A. (2002). A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics 28(1), 1936.Google Scholar
Pokojski, J. (2006). Knowledge Based Engineering and Intelligent Personal Assistant Context in Distributed Design, Intelligent Computing in Engineering and Architecture, pp. 519528. Berlin: Springer.Google Scholar
Qiu, L., Kan, M.Y., & Chua, T.-S. (2004). A public reference implementation of the RAP anaphora resolution algorithm. Proc. 4th Int. Conf. Language Resources and Evalution, Lisbon, Portugual.Google Scholar
Reynar, J.C. (1999). Statistical models for topic segmentation. Proc. 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Stein, B. (2004). Topic identification: framework and application. Proc. I-KNOW ’04, Graz, Austria, June 30–July 2.Google Scholar
Tofiloski, M., Brooke, J., & Taboada, M. (2009). A syntactic and lexical-based discourse segmenter. Proc. ACL-IJCNLP 2009 Conf. Short Papers. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Venkatachalam, A.R., Mellichamp, J.M., & Miller, M.D. (1993). A knowledge-based approach to design for manufacturability. Journal of Intelligent Manufacturing 4(5), 355366.Google Scholar
Wijewickrema, C.M., & Gamage, R. (2013). An ontology based fully automatic document classification system using an existing semi-automatic system, Proc. IFLA WLIC 2013. Singapore: Future Libraries: Infinite Possibilities.Google Scholar
Xie, S.Q., PTu, P.L., & Zhou, Z.D. (2004). Internet-based DFX for rapid and economical tool/mould making. International Journal of Advanced Manufacturing Technology 24(11–12), 821829.Google Scholar
Zhang, W., Sim, Y.C., Su, J., & Tan, C.L. (2011). Entity linking with effective acronym expansion, instance selection, and topic modeling. Proc. 23rd. Int Joint Conf. Artificial Intelligence, pp. 1909–1914. Cambridge, MA: MIT Press.Google Scholar
Zheng, H.-T., Kang, B.-Y., & Kim, H.-G. (2009). Exploiting noun phrases and semantic relationships for text document clustering. Information Sciences 179(13), 22492262.CrossRefGoogle Scholar