Eye tracking for the online evaluation of prosody in speech synthesis

doi:10.1017/CBO9780511844492.012

12 - Eye tracking for the online evaluation of prosody in speech synthesis

from Part V - Evaluation and shared tasks

Published online by Cambridge University Press: 05 July 2014

Michael White ,

Rajakrishnan Rajkumar ,

Kiwako Ito and

Shari R. Speer

Edited by

Amanda Stent and

Srinivas Bangalore

Show author details

Michael White: Affiliation:
Ohio State University
Rajakrishnan Rajkumar: Affiliation:
Ohio State University
Kiwako Ito: Affiliation:
Ohio State University
Shari R. Speer: Affiliation:
Ohio State University
Amanda Stent: Affiliation:
AT&T Research, Florham Park, New Jersey
Srinivas Bangalore: Affiliation:
AT&T Research, Florham Park, New Jersey

Book contents

Get access

Summary

Introduction

The past decade has witnessed remarkable progress in speech synthesis research, to the point where synthetic voices can be hard to distinguish from natural ones, at least for utterances with neutral, declarative prosody. Neutral intonation often does not suffice, however, in interactive systems: instead it can sound disengaged or “dead,” and can be misleading as to the intended meaning.

For concept-to-speech systems, especially interactive ones, natural language generation researchers have developed a variety of methods for making contextually appropriate prosodic choices, depending on discourse-related factors such as givenness, parallelism, or theme/rheme alternative sets, as well as information-theoretic considerations (Prevost, 1995; Hitzeman et al., 1998; Pan et al., 2002; Bulyko and Ostendorf, 2002; Theune, 2002; Kruijff-Korbayová et al., 2003; Nakatsu and White, 2006; Brenier et al., 2006; White et al., 2010). In this setting, it is possible to adapt limited-domain synthesis techniques to produce utterances with perceptually distinguishable, contextually varied intonation (see Black and Lenzo, 2000; Baker, 2003; van Santen et al., 2005; Clark et al., 2007, for example). To evaluate these utterances, listening tests have typically been employed, sometimes augmented with expert evaluations. For example, evaluating the limited domain voice used in the FLIGHTS concept-to-speech system (Moore et al., 2004; White et al., 2010) demonstrated that the prosodic specifications produced by the natural language generation component of the system yielded significantly more natural synthetic speech in listening tests and, in an expert evaluation, compared to two baseline voices.

Type: Chapter
Information: Natural Language Generation in Interactive Systems , pp. 281 - 301

DOI: https://doi.org/10.1017/CBO9780511844492.012 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Baker, R. E. (2003). Using Unit Selection to Synthesise Contextually Appropriate Intonation in Limited Domain Synthesis. Master's thesis, Department of Linguistics, University of Edinburgh.Google Scholar

Beckman, M. E., Hirshberg, J., and Shattuck-Hufnagel, S. (2005). The original ToBI system and the evolution of the ToBI framework. In Jun, S.-A., editor, Prosodic Typology: The Phonology of Intonation and Phrasing. Oxford University Press, Oxford, UK.Google Scholar

Black, A. and Lenzo, K. (2000). Limited domain synthesis. In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 411-414, Beijing, China. International Speech Communication Association.Google Scholar

Brenier, J., Nenkova, A., Kothari, A., Whitton, L., Beaver, D., and Jurafsky, D. (2006). The (non)utility of linguistic features for predicting prominence in spontaneous speech. In Proceedings of the IEEE Spoken Language Technology Workshop, pages 54-57, Palm Beach, FL. Institute of Electrical and Electronics Engineers.Google Scholar

Bulyko, I. and Ostendorf, M. (2002). Efficient integrated response generation from multiple targets using weighted finite state transducers. Computer Speech and Language, 16(3-1): 533-550.CrossRef Google Scholar

Clark, R. A. J., Richmond, K., and King, S. (2007). Multisyn: Open-domain unit selection for the Festival speech synthesis system. Speech Communication, 49(4):317-330.CrossRef Google Scholar

De Carolis, B., Pelachaud, C., Poggi, I., and Steedman, M. (2004). APML, a markup language for believable behavior generation. In Prendinger, H. and Ishizuka, M., editors, Life-like Characters. Tools, Affective Functions and Applications, pages 65-85. Springer, Berlin, Germany.Google Scholar

Espinosa, D., White, M., Fosler-Lussier, E., and Brew, C. (2010). Machine learning for text selection with expressive unit-selection voices. In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 1125-1128, Makuhari, Chiba, Japan. International Speech Communication Association.Google Scholar

Hitzeman, J., Black, A. W., Mellish, C, Oberlander, J., and Taylor, P. (1998). On the use of automatically generated discourse-level information in a concept-to-speech synthesis system. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Sydney, Australia. International Speech Communication Association.Google Scholar

Ito, K. and Speer, S. R. (2008). Use of L + H* for immediate contrast resolution. In Proceedings of Speech Prosody, Campinas, Brazil. Speech Prosody Special Interest Group.Google Scholar

Ito, K. and Speer, S. R. (2011). Semantically-independent but contextually-dependent interpretation of contrastive accent. In Frota, S., Elordieta, G., and Prieto, P., editors, Prosodic Categories: Production, Perception and Comprehension, pages 69-92. Springer, Dordrecht, The Netherlands.Google Scholar

Karaiskos, V., King, S., Clark, R. A. J., and Mayo, C. (2008). The Blizzard Challenge 2008. In Proceedings of the Blizzard Challenge (in conjunction with the ISCA Workshop on Speech Synthesis), Brisbane, Australia. Blizzard.Google Scholar

King, S. and Karaiskos, V. (2009). The Blizzard Challenge 2009. In Proceedings of the Blizzard Challenge, Edinburgh, Scotland. University of Edinburgh.Google Scholar

Kruijff-Korbayová, I., Ericsson, S., Rodríguez, K. J., and Karagjosova, E. (2003). Producing con-textually appropriate intonation in an information-state based dialogue system. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 227-234, Budapest, Hungary. Association for Computational Linguistics.Google Scholar

Mayo, C., Clark, R. A. J., and King, S. (2011). Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis. Speech Communication, 53(3): 311-326.CrossRef Google Scholar

Moore, J. D., Foster, M. E., Lemon, O., and White, M. (2004). Generating tailored, comparative descriptions in spoken dialogue. In Proceedings of the Florida Artificial Intelligence Research Society Conference (FLAIRS), pages 917-922, Miami Beach, FL. The Florida Artificial Intelligence Research Society.Google Scholar

Nakatsu, C and White, M (2006) Learning to say it well: Reranking realizations by predicted synthesis quality. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 1113-1120, Sydney, Australia. Association for Computational Linguistics.Google Scholar

Pan, S., McKeown, K., and Hirschberg, J. (2002). Exploring features from natural language generation for prosody modeling. Computer Speech and Language, 16:457-490.CrossRef Google Scholar

Prevost, S. (1995). A Semantics of Contrast and Information Structure for Specifying Intonation in Spoken Language Generation. PhD thesis, University of Pennsylvania.Google Scholar

Rajkumar, R., White, M., Ito, K., and Speer, S. R. (2010). Evaluating prosody in synthetic speech with online (eye-tracking) and offline (rating) methods. In Proceedings of the ISCA Workshop on Speech Synthesis (SSW), pages 276-281, Kyoto, Japan. International Speech Communication Association.Google Scholar

Speer, S. R. (2011). Eye movements as a measure of spoken language processing. In Cohn, A. C., Fougeron, C., and Huffman, M. K., editors, The Oxford Handbook of Laboratory Phonology, pages 580-592. Oxford University Press, Oxford, UK.Google Scholar

Swift, M. D., Campana, E., Allen, J. F., and Tanenhaus, M. K. (2002). Monitoring eye movements as an evaluation of synthesized speech. In Proceedings of the IEEE Workshop on Speech Synthesis, pages 19-22, Santa Monica, CA. Institute of Electrical and Electronics Engineers.Google Scholar

Theune, M. (2002). Contrast in concept-to-speech generation. Computer Speech and Language, 16(3-4):491-531.CrossRef Google Scholar

van Hooijdonk, C, Commandeur, E., Cozijn, R., Krahmer, E., and Marsi, E. (2007). The online evaluation of speech synthesis using eye movements. In Proceedings of the ISCA Workshop on Speech Synthesis, pages 385-390, Bonn, Germany. International Speech Communication Association.Google Scholar

van Santen, J., Kain, A., Klabbers, E., and Mishra, T. (2005). Synthesis of prosody using multilevel unit sequences. Speech Communication, 46(3-4):365-375.CrossRef Google Scholar

White, M., Clark, R. A. J., and Moore, J. (2010). Generating tailored, comparative descriptions with contextually appropriate intonation. Computational Linguistics, 36(2):159-201.CrossRef Google Scholar

White, M., Rajkumar, R., Ito, K., and Speer, S. R. (2009). Eye tracking for the online evaluation of prosody in speech synthesis: Not so fast! In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH), pages 2523-2526, Brighton, UK. International Speech Communication Association.Google Scholar

Zen, H., Tokuda, K., and Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11): 1039-1064.CrossRef Google Scholar