Hostname: page-component-586b7cd67f-2brh9 Total loading time: 0 Render date: 2024-11-28T22:18:51.578Z Has data issue: false hasContentIssue false

A new approach for textual feature selection based on N-composite isolated labels

Published online by Cambridge University Press:  29 April 2019

Samir Elloumi*
Affiliation:
University of Tunis El Manar, Faculty of Sciences of Tunis, Computer Science Department, Tunis, Tunisia
*
*Corresponding author. Email: [email protected]

Abstract

Textual Feature Selection (TFS) aims to extract relevant parts or segments from text as being the most relevant ones w.r.t. the information it expresses. The selected features are useful for automatic indexing, summarization, document categorization, knowledge discovery, so on. Regarding the huge amount of electronic textual data daily published, many challenges related to the semantic aspect as well as the processing efficiency are addressed. In this paper, we propose a new approach for TFS based on Formal Concept Analysis background. Mainly, we propose to extract textual features by exploring the regularities in a formal context where isolated points exist. We introduce the notion of N-composite isolated points as a set of N words to be considered as a unique textual feature. We show that a reduced value of N (between 1 and 3) allows extracting significant textual features compared with existing approaches even for non-completely covering an initial formal context.

Type
Article
Copyright
© Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agrawal, R. and Batra, M. (2013). A detailed study on text mining techniques. International Journal of Soft Computing and Engineering (IJSCE) ISSN 2(6), 22312307.Google Scholar
Agrawal, R., Imielinski, T. and Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, USA, pp. 207216.CrossRefGoogle Scholar
Bastide, Y., Pasquier, N., Taouil, R., Lakhal, L. and Stumme, G. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. Proceedings of the International Conference DOOD’2000, LNCS, Springer-Verlag, pp. 972986.CrossRefGoogle Scholar
Belohlavek, R. and Vychodil, V. (2010). Discovery of optimal factors in binary data via a novel method of matrix decomposition. Journal of Computer and System Sciences 76(1), 310.CrossRefGoogle Scholar
Berend, G. (2016). Exploiting extra-textual and linguistic information in keyphrase extraction. Natural Language Engineering 22(1), 7395.CrossRefGoogle Scholar
Berger, C. (2012). Big data analytics with oracle advanced analytics in-database option. Oracle and/or its affiliates: Data Mining and Advanced Analytics. Available at http://www.oracle.com/technetwork/database/options/advanced-analytics/oaa12cpreso-1964644.pdf. Last visited in November 2018.Google Scholar
Bernotas, M., Karklius, K.Laurutis, R. and Asta Slotkien, A. (2007). The peculiarities of the text document representation, using ontology and tagging-based clustering technique. 124x Information Technology and Control 36(2), 217220.Google Scholar
Besanon, R., De Chalendar, G., Ferret, O., Gara, F., Mesnard, O., Lab, M. and Semmar, N. (2010). LIMA: A Multilingual Framework for Linguistic Analysis and Linguistic Resources Development and Evaluation. Proceedings of LREC 2010, pp. 36973704.Google Scholar
Bird, S. and Loper, E. (2004). NLTK: the natural language toolkit. Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, pp. 6972.CrossRefGoogle Scholar
Brank, J., Grobelnik, M., Frayling, N. and Mladenic, D. (2002). Interaction of Feature Selection Methods and Linear Classification Models. Proceedings of the 19th Conference on Machine Learning (ICML-02), Workshop on Text Learning.Google Scholar
Chang, C. and Lin, C. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 127. Available at http://www.csie.ntu.edu.tw/cjlin/libsvm.CrossRefGoogle Scholar
Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Inductive learning algorithms and representations for text categorization. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 230239.Google Scholar
Dumais, S., Platt, J., Heckerman, D. and Sahami, M. (1998). Feature selection methods for text classification. Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management 2, 148–55.Google Scholar
Elloumi, S., Boulifa, B., Jaoua, A., Saleh, M., Al Otaibi, J. and Frias, M. (2014). Inference engine based on closure and join operators over truth table BRs. Journal of Logical and Algebraic Methods in Programming 83(2), 180193.CrossRefGoogle Scholar
Elloumi, S., Ferjani, F. and Jaoua, A. (2016). Using minimal generators for composite isolated point extraction and conceptual binary relation coverage: Application for extracting relevant textual features. Information Sciences 336, 129144.CrossRefGoogle Scholar
Ferjani, F., Elloumi, S., Jaoua, A., Ben Yahia, S., Ismail, S. and Ravan, S. (2012). Formal context coverage based on isolated labels: An efficient solution for text feature extraction. Information Sciences 188, 198214.CrossRefGoogle Scholar
Financial Keywords (2017). A Collection of Financial Keywords and Phrases. Software available at http://home.ubalt.edu/ntsbarsh/stat-data/keysphrasfinance.htm. Last visited in march 2017.Google Scholar
Ganter, B. and Wille, R. (1999). Formal Concept Analysis. Berlin: Springer-Verlag.CrossRefGoogle Scholar
Garey, M.R. and Johnson, D.S. (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness (Series of Books in the Mathematical Sciences). San Francisco: W. H. Freeman and Company, p. 340.Google Scholar
Gharehchopogh, F.S. (2010). Approach and review of user oriented interactive data mining. IEEE the 4th International Conference on Application of Information and Communication Technologies, AICT2010. IEEE, Tashkent, Uzbekistan.Google Scholar
Godin, R., Missaoui, R. and Alaoui, H. (1995). Incremental concept formation algorithms based on Galois (concept) lattices. Computational Intelligence 11(2), 246267.CrossRefGoogle Scholar
Gordon, M.D. and Kochen, M. (1998). Recall-precision trade-ok: a derivation. Journal of the American Society for Information Science 40, 145151.3.0.CO;2-I>CrossRefGoogle Scholar
Gosset, W.S. (1908). Student. In The Probable Error of a Mean, Biometrikam, vol. 6, pp. 125.Google Scholar
Gupta, V. (2009). A survey of text mining techniques and applications. Journal of Emerging Technologies in Web Intelligence 1, 6076.CrossRefGoogle Scholar
Harish, B.S., Manjunath, S. and Guru, D.S. (2012). Text document classification: An approach based on indexing. International Journal of Data Mining and Knowledge Management Process (IJDKP) 2, 4362.CrossRefGoogle Scholar
Hasan, K. S. and Ng, V. (2014). Automatic Keyphrase Extraction: A Survey of the State of the Art. ACL, vol. 1, pp. 12621273.CrossRefGoogle Scholar
Irfan, R., King, C., Grages, D., Ewen, S., Khan, S., Madani, S. and Li, H. (2015). A survey on text mining in social networks. The Knowledge Engineering Review 30(2), 157170. doi: 10.1017/S0269888914000277CrossRefGoogle Scholar
Jaoua, A., AlJa’am, J., Hammami, H., Ferjani, F., Laban, F., Semmar, N., Essafi, H. and Elloumi, S. (2010). IEEE International Conference on Progress in Informatics and Computing (PIC) 1, 652655.Google Scholar
Jaoua, A., Elloumi, S., Hasnah, A., Jaam, J. and Nafkha, I. (2004). Discovering regularities in databases using canonical decomposition of binary relations. Journal on Relational Methods in Computer Science (JoRMiCS) 1, 217234.Google Scholar
Jones, K.S. and Willet, P. (1997) Readings in Information Retrieval: Porter Stemmer. San Francisco: Morgan Kaufmann. ISBN 1- 55860-454-4.Google Scholar
Kcherif, R., Gammoudi, M.M. and Jaoua, A. (2000). Using difunctional relations in information organization. Information Science 125, 53166.Google Scholar
KDnuggets (2012). Data Integration, Analytical ETL, Data Analysis, and Reporting. Software available at http://sourceforge.net/projects/rapidminer/.Google Scholar
Kosala, R. and Blockeel, H. (2000). Web mining research: A survey. ACM Sigkdd Explorations Newsletter 2, 115.CrossRefGoogle Scholar
Lan, M., Tan, C.L. and Low, H.B. (2006). Proposing a new term weighting scheme for text categorization. AAAI 6, 763768.Google Scholar
Ma, W., Fang, W., Wang, G. and Liu, J. (2007). Concept index for document retrieval with peer-to-peer network. IEEE Computer Society. Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, Qingdao, China, 3, 11191123.Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S. and McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland USA, pp. 5560.CrossRefGoogle Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv: 13013781.Google Scholar
Miller, G.A. (1995). Wordnet: A lexical database for English. Communications of the ACM 38(11), 3941.CrossRefGoogle Scholar
Mouakher, A. and Ben Yahia, S. (2016). QualityCover: Efficient binary relation coverage guided by induced knowledge quality. Information Sciences 355, 5873.CrossRefGoogle Scholar
Osei-Bryson, K. (2010). Towards supporting expert evaluation of clustering results using a data mining process model. Information Sciences 180, 414431.CrossRefGoogle Scholar
Pasquier, N., Bastide, Y., Taouil, R. and Lakhal, L. (1999). Efficient mining of association rules using closed itemset lattices. Information Systems Journal 24(1), 2546.Google Scholar
Passalis, N. and Tefas, A. (2016). Entropy optimized feature-based bag-of-words representation for information retrieval. IEEE Transactions on Knowledge and Data Engineering 28(7), 16641677.CrossRefGoogle Scholar
Pennington, J., Socher, R. and Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 15321543.CrossRefGoogle Scholar
Porter, M.F. (1980). An algorithm for suffix stripping. Program: Electronic Library and Information Systems 14(3), 130137.CrossRefGoogle Scholar
Porter, M.F. (2006). Stemming algorithms for various European languages. Available at http://www.snowball.tartarus.org/texts/stemmersoverview.html. Last visited in November 2018.Google Scholar
Rajbhandari, S. and Keizer, J. (2012). The Agrovoc concept scheme-a walkthrough, food and agriculture organization of the united nations, Rome 00153, Italy. Journal of Integrative Agriculture 11(5), 694699.CrossRefGoogle Scholar
Rennie, J. (2008). The 20-newsgroups Dataset. Available at http://qwone.com/jason/20Newsgroups/. Last visited in November 2018.Google Scholar
Riguet (1948). Relations binaires, fermetures et correspondances de Galois. Bull. Soc. Math. France 78, 114155.Google Scholar
Rodriguez-Esteban, R. (2019). Text mining applications. In Encyclopedia of Bioinformatics and Computational Biology, Academic Press, pp. 9961000.CrossRefGoogle Scholar
Ronen, F. and James, S. (2006). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press.Google Scholar
Salton, G. and McGill, M. (1983). Introduction to Modern Information Retrieval. New York: McGraw-Hill Book Co.Google Scholar
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of International Conference on New Methods in Language Processing. Manchester, UK, pp. 4449.Google Scholar
Stavrianou, A., Andritsos, P. and Nicoloyannis, N. (2007). Overview and semantic issues of text mining. ACM Sigmod Record 36(3), 2234.CrossRefGoogle Scholar
Talib, R., Hanif, M., Ayesha, S. and Fatima, F. (2016). Text mining: Techniques, applications and issues. International Journal of Advanced Computer Science and Applications 7(11), 414418.CrossRefGoogle Scholar
Weiss, S.M. (2005). Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer-Verlag.CrossRefGoogle Scholar
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G. (1999). KEA: Practical automatic keyphrase extraction. In Proceedings of the Fourth ACM Conference on Digital Libraries, 254255.CrossRefGoogle Scholar
WordNet (2017). Wordnet 3.0. http://wordnet.princeton.edu/wordnet/documentation. Last visited in march 2017.Google Scholar
Yang, Y. and Pedersen, J.O. (1997). A comparative study on feature selection in text categorization. ICML 97, 412420.Google Scholar