Book contents
- Frontmatter
- Contents
- List of contributors
- 1 Introduction
- Part I Joint construction
- Part II Reference
- Part III Handling uncertainty
- Part IV Engagement
- Part V Evaluation and shared tasks
- 12 Eye tracking for the online evaluation of prosody in speech synthesis
- 13 Comparative evaluation and shared tasks for NLG in interactive systems
- Author index
- Subject index
- References
12 - Eye tracking for the online evaluation of prosody in speech synthesis
from Part V - Evaluation and shared tasks
Published online by Cambridge University Press: 05 July 2014
- Frontmatter
- Contents
- List of contributors
- 1 Introduction
- Part I Joint construction
- Part II Reference
- Part III Handling uncertainty
- Part IV Engagement
- Part V Evaluation and shared tasks
- 12 Eye tracking for the online evaluation of prosody in speech synthesis
- 13 Comparative evaluation and shared tasks for NLG in interactive systems
- Author index
- Subject index
- References
Summary
Introduction
The past decade has witnessed remarkable progress in speech synthesis research, to the point where synthetic voices can be hard to distinguish from natural ones, at least for utterances with neutral, declarative prosody. Neutral intonation often does not suffice, however, in interactive systems: instead it can sound disengaged or “dead,” and can be misleading as to the intended meaning.
For concept-to-speech systems, especially interactive ones, natural language generation researchers have developed a variety of methods for making contextually appropriate prosodic choices, depending on discourse-related factors such as givenness, parallelism, or theme/rheme alternative sets, as well as information-theoretic considerations (Prevost, 1995; Hitzeman et al., 1998; Pan et al., 2002; Bulyko and Ostendorf, 2002; Theune, 2002; Kruijff-Korbayová et al., 2003; Nakatsu and White, 2006; Brenier et al., 2006; White et al., 2010). In this setting, it is possible to adapt limited-domain synthesis techniques to produce utterances with perceptually distinguishable, contextually varied intonation (see Black and Lenzo, 2000; Baker, 2003; van Santen et al., 2005; Clark et al., 2007, for example). To evaluate these utterances, listening tests have typically been employed, sometimes augmented with expert evaluations. For example, evaluating the limited domain voice used in the FLIGHTS concept-to-speech system (Moore et al., 2004; White et al., 2010) demonstrated that the prosodic specifications produced by the natural language generation component of the system yielded significantly more natural synthetic speech in listening tests and, in an expert evaluation, compared to two baseline voices.
- Type
- Chapter
- Information
- Natural Language Generation in Interactive Systems , pp. 281 - 301Publisher: Cambridge University PressPrint publication year: 2014
References
- 1
- Cited by