Hostname: page-component-cd9895bd7-dk4vv Total loading time: 0 Render date: 2024-12-25T02:34:25.276Z Has data issue: false hasContentIssue false

Learning from noisy out-of-domain corpus using dataless classification

Published online by Cambridge University Press:  17 June 2020

Yiping Jin
Affiliation:
Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok10300, Thailand
Dittaya Wanvarie*
Affiliation:
Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok10300, Thailand
Phu T. V. Le
Affiliation:
Knorex Pte. Ltd., 8 Cross St, Singapore 048424, Singapore
*
*Corresponding author. E-mail: [email protected]

Abstract

In real-world applications, text classification models often suffer from a lack of accurately labelled documents. The available labelled documents may also be out of domain, making the trained model not able to perform well in the target domain. In this work, we mitigate the data problem of text classification using a two-stage approach. First, we mine representative keywords from a noisy out-of-domain data set using statistical methods. We then apply a dataless classification method to learn from the automatically selected keywords and unlabelled in-domain data. The proposed approach outperformed various supervised learning and dataless classification baselines by a large margin. We evaluated different keyword selection methods intrinsically and extrinsically by measuring their impact on the dataless classification accuracy. Last but not least, we conducted an in-depth analysis of the behaviour of the classifier and explained why the proposed dataless classification method outperformed supervised learning counterparts.

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Breve, F.A., Zhao, L. and Quiles, M.G. (2010). Semi-supervised learning from imperfect data through particle cooperation and competition. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain. Institute of Electrical and Electronics Engineers, pp. 18.CrossRefGoogle Scholar
Brodley, C.E., Friedl, M.A. et al. (1996). Identifying and eliminating mislabeled training instances. In Proceedings of the National Conference on Artificial Intelligence, Portland, Oregon. Association for the Advancement of Artificial Intelligence, pp. 799805.Google Scholar
Carbonell, J.G. and Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, vol. 98. Association for Computing Machinery, pp. 335336.CrossRefGoogle Scholar
Chang, M.W., Ratinov, L.A., Roth, D. and Srikumar, V. (2008). Importance of semantic representation: Dataless classification. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, Chicago, Illinois, vol. 2. Association for the Advancement of Artificial Intelligence, pp. 830835.Google Scholar
Charoenphakdee, N., Lee, J., Jin, Y., Wanvarie, D. and Sugiyama, M. (2019). Learning only from relevant keywords and unlabeled documents. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 39843993.CrossRefGoogle Scholar
Dahlmeier, D. (2017). On the challenges of translating NLP research into commercial products. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, vol. 2. Association for Computational Linguistics, pp. 9296.CrossRefGoogle Scholar
Dietterich, T.G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2), 139157.CrossRefGoogle Scholar
Druck, G., Mann, G. and McCallum, A. (2008). Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, Singapore. Association for Computing Machinery, pp. 595602.CrossRefGoogle Scholar
Frénay, B. and Verleysen, M. (2014). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning systems 25(5), 845869.CrossRefGoogle ScholarPubMed
Gabrilovich, E., Markovitch, S. et al. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, Hyderabad, vol. 7. International Joint Conferences on Artificial Intelligence, pp. 16061611.Google Scholar
Gamberger, D., Lavrac, N. and Groselj, C. (1999). Experiments with noise filtering in a medical domain. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia. International Machine Learning Society, pp. 143151.Google Scholar
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M. and Lempitsky, V. (2016). Domain-adversarial training of neural networks. Journal of Machine Learning Research 17(1), 2096–2030. ISSN 1532-4435.Google Scholar
Gerlach, R. and Stamey, J. (2007). Bayesian model selection for logistic regression with misclassified outcomes. Statistical Modelling 7(3), 255273.CrossRefGoogle Scholar
Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 328339.CrossRefGoogle Scholar
Hsu, P.L. and Robbins, H. (1947). Complete convergence and the law of large numbers. Proceedings of the National Academy of Sciences of the United States of America 33(2), 25.CrossRefGoogle ScholarPubMed
Jin, Y., Wanvarie, D. and Le, P. (2017). Combining lightly-supervised text classification models for accurate contextual advertising. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, vol. 1. Asian Federation of Natural Language Processing, pp. 545554.Google Scholar
King, B. and Abney, S.P. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, USA. Association for Computational Linguistics, pp. 11101119.Google Scholar
Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, Lake Tahoe, Nevada, USA, pp. 10971105.Google Scholar
Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA. International Machine Learning Society, pp. 331339.CrossRefGoogle Scholar
Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning-Volume 32, Beijing, China. International Machine Learning Society, pp. 11881196.Google Scholar
Li, C., Chen, S., Xing, J., Sun, A. and Ma, Z. (2018a). Seed-guided topic model for document filtering and classification. ACM Transactions on Information Systems (TOIS) 37(1), 9.CrossRefGoogle Scholar
Li, C., Xing, J., Sun, A. and Ma, Z. (2016). Effective document labeling with very few seed words: A topic model approach. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, USA. Association for Computing Machinery, pp. 8594.CrossRefGoogle Scholar
Li, C., Zhou, W., Ji, F., Duan, Y. and Chen, H. (2018b). A deep relevance model for zero-shot document filtering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 23002310.CrossRefGoogle Scholar
Li, X. and Yang, B. (2018). A pseudo label based dataless naive bayes algorithm for text classification with seed words. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 1908–1917.Google Scholar
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Portland, Oregon, USA. Association for Computational Linguistics, pp. 142150.Google Scholar
Meng, Y., Shen, J., Zhang, C. and Han, J. (2018). Weakly-supervised neural text classification. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy. Association for Computing Machinery, pp. 983992.CrossRefGoogle Scholar
Merity, S., Xiong, C., Bradbury, J. and Socher, R. (2016). Pointer sentinel mixture models. arXiv preprint.Google Scholar
Mudinas, A., Zhang, D. and Levene, M. (2018). Bootstrap domain-specific sentiment classifiers from unlabeled corpora. Transactions of the Association of Computational Linguistics 6, 269285.CrossRefGoogle Scholar
Nam, J., Menca, E.L. and Fürnkranz, J. (2016). All-in text: Learning document, label, and word representations jointly. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA. Association for the Advancement of Artificial Intelligence.Google Scholar
Nam, J., Menca, E.L., Kim, H.J. and Fürnkranz, J. (2017). Maximizing subset accuracy with recurrent neural networks in multi-label classification. In Advances in Neural Information Processing Systems, Long Beach, CA, USA. Curran Associates, pp. 54135423.Google Scholar
Nettleton, D.F., Orriols-Puig, A. and Fornells, A. (2010). A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial Intelligence Review 33(4), 275306.CrossRefGoogle Scholar
Nguyen-Hoang, B.D., Pham-Hong, B.T., Jin, Y. and Le, P. (2018). Genre-oriented web content extraction with deep convolutional neural networks and statistical methods. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation (PACLIC 32), Hong Kong, China. Association for Computational Linguistics, pp. 452459.Google Scholar
Pan, S.J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10), 13451359.CrossRefGoogle Scholar
Pappas, N. and Henderson, J. (2019). Gile: A generalized input-label embedding for text classification. Transactions of the Association for Computational Linguistics 7, 139155.CrossRefGoogle Scholar
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Association for Computational Linguistics, pp. 22272237. doi: 10.18653/v1/N18-1202.CrossRefGoogle Scholar
Ribeiro, M.T., Singh, S. and Guestrin, C. (2016). “why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. Association for Computing Machinery, pp. 11351144.CrossRefGoogle Scholar
Sachan, D, Zaheer, M. and Salakhutdinov, R. (2018). Investigating the working of text classifiers. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 21202131.Google Scholar
Settles, B. (2011). Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland. Association for Computational Linguistics, pp. 14671478.Google Scholar
Song, Y. and Roth, D. (2014). On dataless hierarchical text classification. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Quebec, Canada. Association for the Advancement of Artificial Intelligence Press.Google Scholar
Song, Y., Upadhyay, S., Peng, H., Mayhew, S. and Roth, D. (2019). Toward any-language zero-shot topic classification of textual documents. Artificial Intelligence 274, 133150.CrossRefGoogle Scholar
Song, Y., Upadhyay, S., Peng, H. and Roth, D. (2016). Cross-lingual dataless classification for many languages. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, USA. International Joint Conferences on Artificial Intelligence, pp. 29012907.Google Scholar
Sun, J.W., Zhao, F.Y., Wang, C.J and Chen, S.F. (2007). Identifying and correcting mislabeled training instances. In Future Generation Communication and Networking (FGCN 2007), Jeju-Island, Korea, vol. 1. Institute of Electrical and Electronics Engineers, pp. 244250.Google Scholar
Swartz, T.B., Haitovsky, Y., Vexler, A. and Yang, T.Y. (2004). Bayesian identifiability and misclassification in multinomial data. Canadian Journal of Statistics 32(3), 285302.CrossRefGoogle Scholar
Wang, S. and Manning, C.D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, Jeju Island, Korea. Association for Computational Linguistics, pp. 9094.Google Scholar
Wang, X., Wei, F., Liu, X., Zhou, M. and Zhang, M. (2011). Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, Scotland. Association for Computing Machinery, pp. 10311040.CrossRefGoogle Scholar
Yin, W., Hay, J. and Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 39053914.CrossRefGoogle Scholar
Yogatama, D., Dyer, C., Ling, W. and Blunsom, P. (2017). Generative and discriminative text classification with recurrent neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia. International Machine Learning Society.Google Scholar
Zha, D. and Li, C. (2019). Multi-label dataless text classification with topic modeling. Knowledge and Information Systems 61(1), 137160.CrossRefGoogle Scholar
Zhang, X., Zhao, J. and LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, Montreal, Canada. Curran Associates, pp. 649657.Google Scholar
Zheng, R., Tian, T., Hu, Z., Iyer, R., Sycara, K. et al. (2016). Joint embedding of hierarchical categories and entities for concept categorization and dataless classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japanpp. The COLING 2016 Organising Committee, pp. 26782688.Google Scholar
Zipf, G.K. (1949). The Principle of Least Effort: An Introduction to Human Ecology. Boston, USA: Addison Wesley.Google Scholar