Skip to main content Accessibility help
×
Hostname: page-component-cd9895bd7-fscjk Total loading time: 0 Render date: 2024-12-23T08:55:01.820Z Has data issue: false hasContentIssue false

15 - Beyond Functional Speech Synthesis

from Section III - Measuring Speech

Published online by Cambridge University Press:  11 November 2021

Rachael-Anne Knight
Affiliation:
City, University of London
Jane Setter
Affiliation:
University of Reading
Get access

Summary

As synthetic voices enter the mass market, there is an increasing need for voice personalisation, that is, a voice for the text-to-speech system that not only conveys information but also exudes a persona much like the human voice. We begin this chapter with a historical overview of the field starting with model-based approaches, to concatenative systems and finally to contemporary implementations of parametric synthesis. We then examine how the confluence of increased computational speed at reduced costs, the availability of large data sets, and advances in machine learning and artificial intelligence enable a whole new approach to speech synthesis, including the ability to create high-quality personalised voices. We then examine the role of crowdsourcing in developing a scalable method for voice customisation and adaptation. We discuss the benefits and challenges of acquiring recordings of novice voice talent of all ages from around the world and using these recordings for voice building. We conclude by discussing the impact of personalised voices and implications for future work.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

15.7 References

Adank, P., Stewart, A. J., Connell, I. & Wood, J. (2013). Accent imitation positively affects language attitudes. Frontiers of Psychology, 4, 280.Google Scholar
Bachorowski, J. A. & Owren, M. J. (1999). Acoustic correlates of talker sex and individual talker identity are present in a short vowel segment produced in running speech. Journal of the Acoustical Society of America, 106(2), 1054–63.Google Scholar
Black, A. W., Zen, H. & Tokuda, K. (2007). Statistical parametric speech synthesis. Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. IV-1229–32.CrossRefGoogle Scholar
Cohen, M. H., Giangola, J. P. & Balogh, J. (2004). Voice User Interface Design. Redwood City, CA: Addison-Wesley Longman.Google Scholar
Collins, S. A. (2000). Male voices and women’s choices. Animal Behavior, 60(6), 773–80.Google Scholar
Feinberg, D. R., Jones, B. C., Little, A. C. & Perrett, D. I. (2005). Manipulations of fundamental and formant frequencies influence the attractiveness of human male voices. Animal Behavior, 69(3), 561–8.Google Scholar
Flanagan, J. L. (1965). Speech Analysis, Synthesis and Perception. Berlin: Springer-Verlag.Google Scholar
Flanagan, J. L. (1972). Voices of men and machines. Journal of the Acoustical Society of America, 51, 1375–87.Google Scholar
Fitch, W. T. & Giedd, J. (1999). Morphology and development of the human vocal tract: A study using magnetic resonance imaging. Journal of the Acoustical Society of America, 106(3), 1511–22.CrossRefGoogle ScholarPubMed
Hartman, D. E. & Danhauer, J. L. (1976). Perceptual features of speech for males in four perceived age decades. Journal of the Acoustical Society of America, 59(3), 713–15.Google Scholar
Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–80.Google Scholar
Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E. et al. (2018). Efficient neural audio synthesis, arXiv, 1802.08435.Google Scholar
Jia, Y., Zhang, Y., Weiss, R. Wang, Q., Shen, J., Ren, F. et al. (2019). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv, 1806.04558.Google Scholar
Kinsella, B. (2019). Why tech giants are so desperate to provide your voice assistant. Harvard Business Review, https://hbr.org/2019/05/why-tech-giants-are-so-desperate-to-provide-your-voice-assistant.Google Scholar
Knudson, J. (2019). Digital publishers prepare for the voice revolution. Econtent Magazine, www.econtentmag.com/Articles/Editorial/Feature/Digital-Publishers-Prepare-for-the-Voice-Revolution-130768.htm.Google Scholar
Light, J. C. & McNaughton, D. (2014). Communicative competence for individuals who require augmentative and alternative communication: A new definition for a new era of communication? Augmentative and Alternative Communication, 30(1), 118.Google Scholar
Linville, S. (1998). Acoustic correlates of perceived versus actual sexual orientation in men’s speech. Pholia Phoniatrica et Logopaedica, 50(1), 3548.Google Scholar
Munson, B., McDonald, E., DeBoe, N. & White, A. (2006). The acoustic and perceptual bases of judgments of women and men’s sexual orientation from read speech. Journal of Phonetics, 34(2), 202–40.Google Scholar
Peschke, C., Ziegler, W., Eisenberger, J. & Baumbaertner, A. (2012). Phonological manipulation between speech perception and production activated a parieto-frontal circuit. NeuroImage, 59, 788–99.Google Scholar
Pierrehumbert, J., Bent, T., Munson, B., Bradlow, A. R. & Bailey, J. M. (2004). The influence of sexual orientation on vowel production. Journal of the Acoustic Society of America, 116, 1905–8.CrossRefGoogle ScholarPubMed
Rabiner, L. & Juang, B. J. (1993). Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
Ridley, L. [Lost Voice Guy]. (2012).Voice by Choice. Comedy sketch by Lee Ridley, Lost VoiceGuy[Video File]. Retrieved from www.youtube.com/watch?v=CMm_XL3Ipbo.Google Scholar
Schabus, D. (2009). Interpolation of Austrian German and Viennese Dialect/Sociolect in HMM-based Speech Synthesis. Thesis, Vienna University of Technology.Google Scholar
Smyth, R., Jacobs, G. & Rogers, H. (2003). Male voices and perceived sexual orientation: An experiment and theoretical approach. Language and Society, 32(2), 329–50.Google Scholar
Stevens, K. (1998). Acoustic Phonetics. Cambridge, MA: MIT Press.Google Scholar
Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J. & Oura, K. (2013). Speech synthesis based on Hidden Markov Models. Proceedings of the IEEE, 101(5), 1234–52.CrossRefGoogle Scholar
Toman, M. (2016). Transformation and Interpolation of Language Varieties for Speech Synthesis. Thesis, Vienna University of Technology.Google Scholar
Toman, M., Pucher, M. & Moosmüller, S. (2015). Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis. Speech Communication, 72, 176–93.Google Scholar
Toman, M, Meltzner, G. S. & Patel, R. (2018). Data requirements and augmentation for DNN-based speech synthesis from crowdsourced data. In Proceedings of INTERSPEECH 2018, Hyderabad, pp. 2878–82.Google Scholar
van den Oord, A, Dieleman, S., Zen, H., Simonya, K, Vinyals, O., Graves, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv: 1609.03499.Google Scholar
Walton, J. & Orlikoff, R. (1994). Speaker race identification from acoustic cues in the vocal signal. Journal of Speech, Language, and Hearing Research, 37(4), 738–45.Google Scholar
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N. et al. (2017). Tacotron: Towards end-to-end speech synthesis. Proceedings of INTERSPEECH 2017, Stockholm, pp. 4006–10.Google Scholar
Young, S. (2010). Cognitive user interfaces. IEEE Signal Processing Magazine, 27(3), 128–40.Google Scholar
Zen, H., Senior, A. and Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7962–6.Google Scholar
Zen, H., Agiomyrgiannakis, Y., Egberts, N., Henderson, F. & Szczepaniak, P. (2016). Fast, compact, and high quality LSTM-RNN-based statistical parametric speech synthesizers for mobile devices. In Proceedings of INTERSPEECH 2016, San Francisco, pp. 2273–7.Google Scholar
Zuckerman, M. & Miyake, K. (1993). The attractive voice: What makes it so? Journal of Nonverbal Behavior, 17(2), 119–35.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×