Hostname: page-component-586b7cd67f-l7hp2 Total loading time: 0 Render date: 2024-11-29T08:05:57.889Z Has data issue: false hasContentIssue false

Improving speech emotion recognition based on acoustic words emotion dictionary

Published online by Cambridge University Press:  10 June 2020

Wang Wei
Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China
Xinyi Cao
Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China
He Li
Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
Lingjie Shen
Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China
Yaqin Feng
Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China
Paul A. Watters*
Affiliation:
Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
*
*Corresponding author. E-mail: [email protected]

Abstract

To improve speech emotion recognition, a U-acoustic words emotion dictionary (AWED) features model is proposed based on an AWED. The method models emotional information from acoustic words level in different emotion classes. The top-list words in each emotion are selected to generate the AWED vector. Then, the U-AWED model is constructed by combining utterance-level acoustic features with the AWED features. Support vector machine and convolutional neural network are employed as the classifiers in our experiment. The results show that our proposed method in four tasks of emotion classification all provides significant improvement in unweighted average recall.

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Jeannette, N., Lee, S. and Narayanan, S.S. (2008). IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42(4), 335359.Google Scholar
Cao, H., Savran, A. and Verma, R. (2015). Acoustic and lexical representations for affect prediction in spontaneous conversations. Computer Speech & Language 29(1), 203217.CrossRefGoogle ScholarPubMed
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning 20(3), 273297.CrossRefGoogle Scholar
Ekman, P. (1992). Are there basic emotions? Psychological Review 99(3), 550553.CrossRefGoogle ScholarPubMed
Eyben, F., Wöllmer, M. and Schuller, B. (2010). Opensmile: the munich versatile and fast open-source audio feature extractor. In MM’10 - Proceedings of the ACM Multimedia 2010 International Conference, pp. 1459–1462.CrossRefGoogle Scholar
Eyben, F., Scherer, K.R., Truong, K.P., Schuller, B.W., Sundberg, J. and Andre, E. (2016). The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing 7(2), 190202.CrossRefGoogle Scholar
Fayek, H.M., Lech, M., Cavedon, L. and Wu, H. (2017). Evaluating deep learning architectures for Speech Emotion Recognition. Neural Networks 92(1), 6068.CrossRefGoogle ScholarPubMed
Fernandez, R. (2004). A computational model for the automatic recognition of affect in speech. Thesis Massachusetts Institute of Technology 28(1), 5058.Google Scholar
Fernandez, R. and Picard, R. (2011). Recognizing affect from speech prosody using hierarchical graphical models. Speech Communication 53(9C10), 88103.CrossRefGoogle Scholar
Jin, Q., Li, C. and Chen, S. (2015). Speech emotion recognition with acoustic and lexical features. pp. 47494753. doi:10.1109/ICASSP.2015.7178872.CrossRefGoogle Scholar
Keren, G. and Schuller, B. (2016). Convolutional RNN: an enhanced model for extracting features from sequential data. In 2016 International Joint Conference on Neural Networks (IJCNN) as part of the IEEE World Congress on Computational Intelligence (IEEE WCCI), Canada: Vancouver, pp. 34123419.CrossRefGoogle Scholar
Lee, C.C., Mower, E., Busso, C., Lee, S. and Narayanan, S. (2011). Emotion recognition using a hierarchical binary decision tree approach. Speech Communication 53(9¨C10), 11621171.CrossRefGoogle Scholar
Litman, D. and Forbes, K. (2003). Recognizing emotions from student speech in tutoring dialogues. Automatic Speech Recognition and Understanding Workshop 25(3), 698704.Google Scholar
Mao, Q., Dong, M., Huang, Z. and Zhan, Y. (2014). Learning ssalient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia 16(8), 22032213.CrossRefGoogle Scholar
Mariooryad, S. and Busso, C. (2013). Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Transactions on Affective Computing 4(2), 183196.CrossRefGoogle Scholar
Metallinou, A., Wollmer, M., Katsamanis, A. and Eyben, F. (2012). Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Transactions on Affective Computing 3(2), 184198.CrossRefGoogle Scholar
Mirsamadi, S., Barsoum, E. and Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In Acoustics, Speech and Signal Processing (ICASSP). LA, New Orleans, pp. 22272231 CrossRefGoogle Scholar
Neumann, M. and Vu, N.T. (2017). Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. In Interspeech, Stockholm, Sweden, pp. 12631267.CrossRefGoogle Scholar
Ozkan, D., Scherer, S. and Morency, L.P. (2012). Step-wise emotion recognition using concatenated-HMM. In 14th ACM International Conference on Multimodal Interaction (ICMI), pp. 477484.CrossRefGoogle Scholar
Schuller, B. and Rigoll, G. (2006). Timing levels in segment-based speech emotion recognition. In INTERSPEECH 2006, International Conference on Spoken Language Processing (ICSLP), pp. 18181821.Google Scholar
Schuller, B., Steidl, S. and Batliner, A. (2009). The Interspeech 2009 emotion challenge. In INTERSPEECH 2009, Conference of the International Speech Communication Association, pp. 312315.CrossRefGoogle Scholar
Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G. and Wendemuth, A. (2010a). Acoustic emotion recognition: a benchmark comparison of performances. In Automatic Speech Recognition & Understanding, ASRU 2009, pp. 552557.CrossRefGoogle Scholar
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F. and Narayanan, S.S. (2010b). The INTERSPEECH 2010 paralinguistic challenge. In INTERSPEECH 2010, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 27942797.CrossRefGoogle Scholar
Shah, M., Chakrabarti, C. and Spanias, A. (2014). A multi-modal approach to emotion recognition using undirected topic models. In IEEE International Symposium on Circuits and Systems, Melbourne VIC, pp. 754757.CrossRefGoogle Scholar
Shami, M.T. and Kamel, M.S. (2005). Segment-based approach to the recognition of emotions in speech. In IEEE International Conference on Multimedia and Expo, Amsterdam, pp. 383389.CrossRefGoogle Scholar
Tian, L., Moore, J.D. and Lai, C. (2015). Emotion recognition in spontaneous and acted dialogues. In International Conference on Affective Computing and Intelligent Interaction, Xi’an, pp. 698704.CrossRefGoogle Scholar
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E. and Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 50895093.CrossRefGoogle Scholar
Wollmer, M., Metallinou, A., Katsamanis, N., Schuller, B. and Narayanan, S. (2012). Analyzing the memory of BLSTM Neural Networks for enhanced emotion classification in dyadic spoken interactions. In IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, pp. 41574160.CrossRefGoogle Scholar
Yang, N., Muraleedharan, R., Kohl, J., Demirkol, I., Heinzelman, W. and Sturge-Apple, M. 2012. Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion. In IEEE Workshop on Spoken Language Technology, Miami, FL, pp. 455460.Google Scholar