We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save this undefined to your undefined account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you used this feature, you will be asked to authorise Cambridge Core to connect with your undefined account.
Find out more about saving content to .
To send this article to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Voice conversion aims to change a source speaker's voice to make it sound like the one of a target speaker while preserving linguistic information. Despite the rapid advance of voice conversion algorithms in the last decade, most of them are still too complicated to be accessible to the public. With the popularity of mobile devices especially smart phones, mobile voice conversion applications are highly desirable such that everyone can enjoy the pleasure of high-quality voice mimicry and people with speech disorders can also potentially benefit from it. Due to the limited computing resources on mobile phones, the major concern is the time efficiency of such a mobile application to guarantee positive user experience. In this paper, we detail the development of a mobile voice conversion system based on the Gaussian mixture model (GMM) and the weighted frequency warping methods. We attempt to boost the computational efficiency by making the best of hardware characteristics of today's mobile phones, such as parallel computing on multiple cores and the advanced vectorization support. Experimental evaluation results indicate that our system can achieve acceptable voice conversion performance while the conversion time for a five-second sentence only takes slightly more than one second on iPhone 7.
Laughter commonly occurs in daily interactions, and is not only simply related to funny situations, but also to expressing some type of attitudes, having important social functions in communication. The background of the present work is to generate natural motions in a humanoid robot, so that miscommunication might be caused if there is mismatching between audio and visual modalities, especially in laughter events. In the present work, we used a multimodal dialogue database, and analyzed facial, head, and body motion during laughing speech. Based on the analysis results of human behaviors during laughing speech, we proposed a motion generation method given the speech signal and the laughing speech intervals. Subjective experiments were conducted using our android robot by generating five different motion types, considering several modalities. Evaluation results showed the effectiveness of controlling different parts of the face, head, and upper body (eyelid narrowing, lip corner/cheek raising, eye blinking, head motion, and upper body motion control).
Engagement represents how much a user is interested in and willing to continue the current dialogue. Engagement recognition will provide an important clue for dialogue systems to generate adaptive behaviors for the user. This paper addresses engagement recognition based on multimodal listener behaviors of backchannels, laughing, head nodding, and eye gaze. In the annotation of engagement, the ground-truth data often differs from one annotator to another due to the subjectivity of the perception of engagement. To deal with this, we assume that each annotator has a latent character that affects his/her perception of engagement. We propose a hierarchical Bayesian model that estimates both engagement and the character of each annotator as latent variables. Furthermore, we integrate the engagement recognition model with automatic detection of the listener behaviors to realize online engagement recognition. Experimental results show that the proposed model improves recognition accuracy compared with other methods which do not consider the character such as majority voting. We also achieve online engagement recognition without degrading accuracy.
In this paper, we propose a novel neutral-to-emotional voice conversion (VC) model that can effectively learn a mapping from neutral to emotional speech with limited emotional voice data. Although conventional VC techniques have achieved tremendous success in spectral conversion, the lack of representations in fundamental frequency (F0), which explicitly represents prosody information, is still a major limiting factor for emotional VC. To overcome this limitation, in our proposed model, we outline the practical elements of the cross-wavelet transform (XWT) method, highlighting how such a method is applied in synthesizing diverse representations of F0 features in emotional VC. The idea is (1) to decompose F0 into different temporal level representations using continuous wavelet transform (CWT); (2) to use XWT to combine different CWT-F0 features to synthesize interaction XWT-F0 features; (3) and then use both the CWT-F0 and corresponding XWT-F0 features to train the emotional VC model. Moreover, to better measure similarities between the converted and real F0 features, we applied a VA-GAN training model, which combines a variational autoencoder (VAE) with a generative adversarial network (GAN). In the VA-GAN model, VAE learns the latent representations of high-dimensional features (CWT-F0, XWT-F0), while the discriminator of the GAN can use the learned feature representations as a basis for a VAE reconstruction objective.
Our aim is to develop a smartphone-based life-logging system. Human activity recognition (HAR) is one of the core techniques to realize it. Recent studies reported the effectiveness of feed-forward neural network (FF-NN) and recurrent neural network (RNN) as a classifier for HAR task. However, there are still unresolved problems in those studies: (1) a life-logging system using only a smartphone for recording device has not been developed, (2) only indoor activities have been utilized for evaluation, (3) insufficient investigations/evaluations of RNN. In this study, we address these unresolved problems as follows: (1) we build a prototype system for life-logging and conduct data recording experiment on this system to include both indoor and outdoor activities. The experimental results of HAR on this new dataset showed that RNN-based classifier was still effective. (2) From the results of a HAR experiment, it was demonstrated that a multi-layered Simple Recurrent Unit with a non-linear transform at the bottom layer and a highway-connection was the most effective. (3) We could grasp the reason for the improvement of RNN from FF-NN by observing the posterior probabilities over test data.