Affect, Emotion, and Behavior Processing in Human Machine Interactions

Development of a computationally efficient voice conversion system on mobile phones
Part of:
- Affect, Emotion and Behavior Processing in Human-Machine Interaction
Shuhua Gao, Xiaoling Wu, Cheng Xiang, Dongyan Huang
Journal:

APSIPA Transactions on Signal and Information Processing / Volume 8 / 2019

Published online by Cambridge University Press:

04 January 2019, e4
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Voice conversion aims to change a source speaker's voice to make it sound like the one of a target speaker while preserving linguistic information. Despite the rapid advance of voice conversion algorithms in the last decade, most of them are still too complicated to be accessible to the public. With the popularity of mobile devices especially smart phones, mobile voice conversion applications are highly desirable such that everyone can enjoy the pleasure of high-quality voice mimicry and people with speech disorders can also potentially benefit from it. Due to the limited computing resources on mobile phones, the major concern is the time efficiency of such a mobile application to guarantee positive user experience. In this paper, we detail the development of a mobile voice conversion system based on the Gaussian mixture model (GMM) and the weighted frequency warping methods. We attempt to boost the computational efficiency by making the best of hardware characteristics of today's mobile phones, such as parallel computing on multiple cores and the advanced vectorization support. Experimental evaluation results indicate that our system can achieve acceptable voice conversion performance while the conversion time for a five-second sentence only takes slightly more than one second on iPhone 7.

Analysis and generation of laughter motions, and evaluation in an android robot
Part of:
- Affect, Emotion and Behavior Processing in Human-Machine Interaction
Carlos Toshinori Ishi, Takashi Minato, Hiroshi Ishiguro
Journal:

APSIPA Transactions on Signal and Information Processing / Volume 8 / 2019

Published online by Cambridge University Press:

25 January 2019, e6
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Laughter commonly occurs in daily interactions, and is not only simply related to funny situations, but also to expressing some type of attitudes, having important social functions in communication. The background of the present work is to generate natural motions in a humanoid robot, so that miscommunication might be caused if there is mismatching between audio and visual modalities, especially in laughter events. In the present work, we used a multimodal dialogue database, and analyzed facial, head, and body motion during laughing speech. Based on the analysis results of human behaviors during laughing speech, we proposed a motion generation method given the speech signal and the laughing speech intervals. Subjective experiments were conducted using our android robot by generating five different motion types, considering several modalities. Evaluation results showed the effectiveness of controlling different parts of the face, head, and upper body (eyelid narrowing, lip corner/cheek raising, eye blinking, head motion, and upper body motion control).

Engagement recognition by a latent character model based on multimodal listener behaviors in spoken dialogue
Part of:
- Affect, Emotion and Behavior Processing in Human-Machine Interaction
Koji Inoue, Divesh Lala, Katsuya Takanashi, Tatsuya Kawahara
Journal:

APSIPA Transactions on Signal and Information Processing / Volume 7 / 2018

Published online by Cambridge University Press:

12 September 2018, e9
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Engagement represents how much a user is interested in and willing to continue the current dialogue. Engagement recognition will provide an important clue for dialogue systems to generate adaptive behaviors for the user. This paper addresses engagement recognition based on multimodal listener behaviors of backchannels, laughing, head nodding, and eye gaze. In the annotation of engagement, the ground-truth data often differs from one annotator to another due to the subjectivity of the perception of engagement. To deal with this, we assume that each annotator has a latent character that affects his/her perception of engagement. We propose a hierarchical Bayesian model that estimates both engagement and the character of each annotator as latent variables. Furthermore, we integrate the engagement recognition model with automatic detection of the listener behaviors to realize online engagement recognition. Experimental results show that the proposed model improves recognition accuracy compared with other methods which do not consider the character such as majority voting. We also achieve online engagement recognition without degrading accuracy.

Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks
Part of:
- Affect, Emotion and Behavior Processing in Human-Machine Interaction
Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi, Yasuo Ariki
Journal:

APSIPA Transactions on Signal and Information Processing / Volume 8 / 2019

Published online by Cambridge University Press:

04 March 2019, e10
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
In this paper, we propose a novel neutral-to-emotional voice conversion (VC) model that can effectively learn a mapping from neutral to emotional speech with limited emotional voice data. Although conventional VC techniques have achieved tremendous success in spectral conversion, the lack of representations in fundamental frequency (F0), which explicitly represents prosody information, is still a major limiting factor for emotional VC. To overcome this limitation, in our proposed model, we outline the practical elements of the cross-wavelet transform (XWT) method, highlighting how such a method is applied in synthesizing diverse representations of F0 features in emotional VC. The idea is (1) to decompose F0 into different temporal level representations using continuous wavelet transform (CWT); (2) to use XWT to combine different CWT-F0 features to synthesize interaction XWT-F0 features; (3) and then use both the CWT-F0 and corresponding XWT-F0 features to train the emotional VC model. Moreover, to better measure similarities between the converted and real F0 features, we applied a VA-GAN training model, which combines a variational autoencoder (VAE) with a generative adversarial network (GAN). In the VA-GAN model, VAE learns the latent representations of high-dimensional features (CWT-F0, XWT-F0), while the discriminator of the GAN can use the learned feature representations as a basis for a VAE reconstruction objective.

Daily activity recognition based on recurrent neural network using multi-modal signals
Part of:
- Affect, Emotion and Behavior Processing in Human-Machine Interaction
Akira Tamamori, Tomoki Hayashi, Tomoki Toda, Kazuya Takeda
Journal:

APSIPA Transactions on Signal and Information Processing / Volume 7 / 2018

Published online by Cambridge University Press:

01 January 2018, e21
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Our aim is to develop a smartphone-based life-logging system. Human activity recognition (HAR) is one of the core techniques to realize it. Recent studies reported the effectiveness of feed-forward neural network (FF-NN) and recurrent neural network (RNN) as a classifier for HAR task. However, there are still unresolved problems in those studies: (1) a life-logging system using only a smartphone for recording device has not been developed, (2) only indoor activities have been utilized for evaluation, (3) insufficient investigations/evaluations of RNN. In this study, we address these unresolved problems as follows: (1) we build a prototype system for life-logging and conduct data recording experiment on this system to include both indoor and outdoor activities. The experimental results of HAR on this new dataset showed that RNN-based classifier was still effective. (2) From the results of a HAR experiment, it was demonstrated that a multi-layered Simple Recurrent Unit with a non-linear transform at the bottom layer and a highway-connection was the most effective. (3) We could grasp the reason for the improvement of RNN from FF-NN by observing the posterior probabilities over test data.

APSIPA Transactions on Signal and Information Processing

Refine listing

Actions for selected content:

Affect, Emotion, and Behavior Processing in Human Machine Interactions

Original Paper

Development of a computationally efficient voice conversion system on mobile phones

Analysis and generation of laughter motions, and evaluation in an android robot

Engagement recognition by a latent character model based on multimodal listener behaviors in spoken dialogue

Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks

Daily activity recognition based on recurrent neural network using multi-modal signals

APSIPA Transactions on Signal and Information Processing

Refine listing

Actions for selected content:

Save Search

Affect, Emotion, and Behavior Processing in Human Machine Interactions

Original Paper

Development of a computationally efficient voice conversion system on mobile phones

Analysis and generation of laughter motions, and evaluation in an android robot

Engagement recognition by a latent character model based on multimodal listener behaviors in spoken dialogue

Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks

Daily activity recognition based on recurrent neural network using multi-modal signals