Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool1

David R. Hill; Craig R. Taube-Schock; Leonard Manzara

doi:10.1017/cnj.2017.15

Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool1

Published online by Cambridge University Press: 21 June 2017

David R. Hill ,

Craig R. Taube-Schock and

Leonard Manzara

Show author details

David R. Hill*: Affiliation:
University of Calgary, Dept. of Computer Science
Craig R. Taube-Schock: Affiliation:
Waikato University, Dept. of Computer Science
Leonard Manzara: Affiliation:
University of Calgary, Dept. of Computer Science
*: [email protected]

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

A complete text-to-speech system has been created by the authors, based on a tube resonance model of the vocal tract and a development of Carré’s “Distinctive Region Model”, which is in turn based on the formant-sensitivity findings of Fant and Pauli (1974), to control the tube. In order to achieve this goal, significant long-term linguistic research has been involved, including rhythm and intonation studies, as well as the development of low-level articulatory data and rules to drive the model, together with the necessary tools, parsers, dictionaries and so on. The tools and the current system are available under a General Public License, and are described here, with further references in the paper, including samples of the speech produced, and figures illustrating the system description.

Résumé

Un système de synthèse vocale complet a été créé par les auteurs, basé sur un modèle de résonance tubulaire du système vocal, et, pour contrôler le tube, sur un développement du modèle aux régions distinctes de René Carré, qui est à son tour basé sur les résultats de Fant and Pauli (1974) au sujet de la sensibilité des formants. Pour atteindre cet objectif, des recherches linguistiques à long terme ont été menées, y compris des études de rythme et d'intonation, ainsi que le développement de données articulatoires de bas niveau et de règles pour faire fonctionner le modèle, ainsi que les outils, les analyseurs syntaxiques, les dictionnaires, etc. Les outils et le système actuel sont disponibles sous une Licence Publique Générale; ils sont décrits ici. D'autres références figurent dans l'article, y compris des exemples de la parole synthétisée et des figures illustrant la description du système.

Keywords

articulatory text-to-speech synthesis rhythm intonation history research tool synthèse vocale rythme intonation histoire outil de recherche

Type: Articles
Information: Canadian Journal of Linguistics/Revue canadienne de linguistique , Volume 62 , Issue 3 , September 2017 , pp. 371 - 410

DOI: https://doi.org/10.1017/cnj.2017.15 [Opens in a new window]
Copyright: © Canadian Linguistic Association/Association canadienne de linguistique 2017

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Numerous people have contributed support, research, and technical assistance. Individuals directly involved in the synthesizer work are listed at <http://www.gnu.org/software/gnuspeech>. Walter Lawrence, Betsy Uldall and David Abercrombie were early mentors for the first author. René Carré originated the basic DRM idea, based on Fant and Pauli's (1974) research. Dalmazio Brisinda and Steve Nygard ported the synthesis system to the Macintosh. Marcelo Matuda ported it to GNU/Linux GNUStep. The Canadian Natural Sciences and Engineering Research Council supported early work under grant A5261. Suggestions by three anonymous reviewers significantly improved the article.

References

Abercrombie, David. 1964. English phonetic texts. London: Faber and Faber.Google Scholar

Abercrombie, David. 1967. Elements of general phonetics. Edinburgh: Edinburgh University Press.Google Scholar

Allen, George D. 1972a. The location of rhythmic stress beats in English: An experimental study I. Language and Speech 15(1): 72–100.Google Scholar

Allen, George D. 1972b. The location of rhythmic stress beats in English: An experimental study II. Language and Speech 15(2): 179–95.Google Scholar

Allen, Jonathan, Hunnicutt, M. Sharon, and Klatt, Dennis. 1987. From text to speech: The MITalk system. Cambridge: Cambridge University Press.Google Scholar

Alleydog. 2016. Psychology class notes: Sensation and perception. <http://ww.alleydog.com/101notes/s&p.html>. Accessed 2016-09-18..+Accessed+2016-09-18.>Google Scholar

Birkholz, Peter. 2013. Modeling consonant–vowel coarticulation for articulatory speech synthesis. PLOS ONE 8(4): e60603. <http://www.ncb1.nlm.nih.gov/pmc/articles/PMC3628899/>. April 16, accessed 2015-01-24.Google Scholar

Boersma, Paul. 2001. PRAAT: Doing phonetics by computer. GLOT International 5(9/10): 341–347. <http://www.fon.hum.uva.nl/praat/>. Accessed 2015-01-24.Google Scholar

Boersma, Paul and van Heuven, Vincent. n.d. Speak and unSpeak with PRAAT. <http://www.fon.hum.uva.nl/paul/papers/speakUnspeakPraatglot2001.pdf>. Accessed 2015-01-24..+Accessed+2015-01-24.>Google Scholar

Carré, René and Mrayati, M.. 1992. Distinctive regions in acoustic tubes: Speech production modelling. Journal d'Acoustique 5: 141–159.Google Scholar

Cohen, Antonie and ‘t Hart, Johan. 1968. On the anatomy of intonation. Lingua 19(1/2): 177–192.Google Scholar

Cook, Perry Raymond. 1990. Identification of control parameters in an articulatory vocal tract model, with applications to the synthesis of singing. Doctoral dissertation, Center for Computer Research on Music and Acoustics, Stanford University. <https://ccrma.stanford.edu/files/papers/stanm68.pdf>. Accessed 2015-01-25..+Accessed+2015-01-25.>Google Scholar

Cooper, Frank S., Liberman, Alvin M., Borst, J. M., and Gerstman, Lou J.. 1952. Some experiments on the perception of synthetic speech sounds. Journal of the Acoustical Society of America 24(6): 597–606.Google Scholar

Crystal, David. 1972. The intonation system of English. In Intonation, ed. Bolinger, Dwight D., 110–135. London: Penguin Books.Google Scholar

Delattre, Pierre. 1969. Coarticulation and the locus theory. Studia Linguistica 23(1): 1–26.CrossRef Google Scholar

Delattre, Pierre, Liberman, Alvin M., and Cooper, Frank S.. 1955. Acoustic loci and transitional cues for consonants. Journal of the Acoustical Society of America 27(4): 769–773.Google Scholar

van den Doel, Kees, Vogt, Florian, English, R. Elliot, and Fels, Sidney. 2006. Towards articulatory speech synthesis with a dynamic 3D finite element tongue model. In 7th international seminar on speech production, 59–66. Ubatuba, Brazil. <http://www.cs.ubc.ca/_kvdoel/publications/ta.pdf>. Accessed 2015-01-21.Google Scholar

Dudley, Homer. 1939. The vocoder. Bell Laboratories Record 17: 122–126.Google Scholar

Dudley, Homer, Riesz, R. R., and Watkins, S. A.. 1939. A synthetic speaker. Journal of the Franklin Institute 227: 739–764.Google Scholar

Dusterhoff, Kurt E. 2000. Synthesizing fundamental frequency using models automatically trained from data. Doctoral dissertation, University of Edinburgh, Edinburgh.Google Scholar

Fant, C. Gunnar M. 1960. Acoustic theory of speech production: With calculations based on x-ray studies of Russian articulations. The Hague: Mouton.Google Scholar

Fant, C. Gunnar M. 1962. OVE II synthesis strategy. 1962 Stockholm Speech Communications Seminar, paper F5.Google Scholar

Fant, C. Gunnar M. and Pauli, S.. 1974. Spatial characteristics of vocal tract resonance models. Tech. Rep., KTH, Stockholm. Proceedings of the Stockholm Communication Seminar.Google Scholar

Fels, Sidney, Vogt, Florian, van den Doel, Kees, Lloyd, John E., Stavness, Ian, and Vatikiotis-Bateson, Eric. 2006. Artisynth: A biomechanical simulation platform for the vocal tract and upper airway. Technical Report TR-2006-10, Computer Science Department, University of British Columbia, Vancouver.Google Scholar

Flanagan, James L. 1972. Speech analysis, synthesis, and perception. Berlin: Springer Verlag.Google Scholar

Green, Peter S. 1958. Consonant–vowel transitions: A spectrographic study. Studia Linguistica 12(2): 57–105.Google Scholar

Halliday, Michael A. K. 1970. A course in spoken English: Intonation. Oxford: Oxford University Press.Google Scholar

‘t Hart, Johan, Collier, Ren, and Cohen, Antonie. 1990. A perceptual study of intonation. Cambridge University Press.Google Scholar

Haskins. n.d. Haskins laboratory publications. <http://www.haskins.yale.edu/pubs.html>..>Google Scholar

Hill, David. 1972. A basis for model building and learning in automatic speech pattern discrimination. Presented at the Machine Perception of Patterns and Pictures Conference No. 13, Institute of Physics, London.Google Scholar

Hill, David. 1978. A program structure for event-based speech synthesis by rules within a flexible segmental framework. International Journal of Man-Machine Studies 10(3): 285–294.Google Scholar

Hill, David, Jassem, Wiktor, and Witten, Ian H.. 1979. A statistical approach to the problem of isochrony in spoken British English. In Current issues in linguistic theory, ed. Hollien, Harry and Hollien, Patricia, vol. 9, 285–294. Amsterdam: John Benjamins.Google Scholar

Hill, David R. and Reid, Neal. 1977. An experiment on the perception of intonational features. International Journal of Man-Machine Studies 9(2): 337–347.CrossRef Google Scholar

Hill, David R., Witten, Ian H., and Jassem, Wiktor. 1977. Some results from a preliminary study of British English speech rhythm. Presented at the 94th meeting of the Acoustical Society of America. <http://pages.cpsc.ucalgary.ca/_hill/papers/>. Accessed 2016-09-26..+Accessed+2016-09-26.>Google Scholar

Hoffman, Howard S. 1958. Study of some cues in the perception of the voiced stop consonants. Journal of the Acoustical Society of America 30(11): 1035–1041.Google Scholar

Holmes, Jon N., Mattingly, Ignatius G., and Shearme, John N.. 1965. Speech synthesis by rules. Language and Speech 7(3): 127–143.Google Scholar

Jassem, Wiktor. 1962. Noise spectra of Swedish, English, and Polish fricatives. Fourth International Congress on Acoustics, Copenhagen, paper G17.Google Scholar

Jassem, Wiktor. 1965. The formants of fricative consonants. Language and Speech 8(1): 1–16.Google Scholar

Jassem, Wiktor, Hill, David R., and Witten, Ian H.. 1984. Isochrony in English speech: Its statistical validity and linguistic relevance. In Pattern, process and function in discourse phonology, ed. Gibbon, Davydd, 203–225. Berlin: de Gruyter.Google Scholar

von Kempelen, W. 1791. Le mécanisme de la parole, suivi de la déscription d'une machine parlante. Vienna: J. V. Degen.Google Scholar

Koenig, W., Dunn, H. K., and Lacy, L. Y.. 1946. the sound spectrograph. Journal of the Acoustical Society of America 18(1): 19.Google Scholar

Kratzenstein, C. G. 1782. Sur la naissance de la formation des voyelles. Journal of Physics 21: 358–380.Google Scholar

Kuhl, Patricia K. 2000. A new view of language acquisition. Proceedings of the National Academy of Sciences 97(22): 11850–11857.Google Scholar

Kuhl, Patricia K., Conboy, Barbara T., Padden, Denise, Nelson, Tobey, and Pruitt, Jessica. 2005. Early speech perception and later language development: Implications for the “critical period”. Language Learning and Development 1(3/4): 237–264.Google Scholar

Ladefoged, Peter and Broadbent, Donald E.. 1957. Information conveyed by vowels. Journal of the Acoustical Society of America 29(1): 98–104.CrossRef Google Scholar

Lawrence, Walter. 1953. The synthesis of speech from signals which have a low information rate. In Communication theory, ed. Jackson, Willis, chap 34. London: Butterworths.Google Scholar

Liberman, Alvin M., Ingemann, Frances, Lisker, Leigh, Delattre, Pierre, and Cooper, Frank S.. 1959. Minimal rules for synthesizing speech. Journal of the Acoustical Society of America 31(11): 1490–1499.Google Scholar

van Lieshout, Pascal. 2003. PRAAT short tutorial: A basic introduction. University of Toronto, Graduate Department of Speech-Language Pathology, Faculty of Medicine, Oral Dynamics Lab. <http://web.stanford.edu/dept/linguistics/corpora/material/PRAATworkshopmanualv421.pdf>Accessed 2015-01-24.Accessed+2015-01-24.>Google Scholar

Lisker, Leigh. 1957. Minimal cues for separating /w, r, l, y/ in intervocalic position. Word 13(2): 256–267.Google Scholar

Manzara, Leonard. 2005. The tube resonance model speech synthesizer. Presented at the 149th Meeting of the Acoustical Society of America/Canadian Acoustical Association, Vancouver. <https://www.researchgate.net/publication/228877073TheTubeResonanceModelSpeechSynthesizer>. Accessed 2016-09-19..+Accessed+2016-09-19.>Google Scholar

McCullough, Gretchen. 2014. When your eyes hear better than your ears: The McGurk effect. <http://tinyurl.com/lqbwzjb>..>Google Scholar

McGurk, Harry and MacDonald, John. 1976. Hearing lips and seeing voices. Nature 264(5588): 746–748.Google Scholar

O'Connor, Joseph D., Gerstman, I. J., Liberman, Alvin M., Delattre, Pierre C., and Cooper, Frank S.. 1957. Acoustic cues for the perception of initial /w, j, r, l/ in English. Word 13(1): 24–43.Google Scholar

O'Shaughnessey, D. 1977. Fundamental frequency by rule for a text-to-speech system. In Proceedings of the international conference on acoustics, speech, and signal processing, 571–574. New York: IEEE.Google Scholar

Palmer, Harold E. and Palmer, Dorothée. 1959. English through actions. London: Longmans Green. [1925; reprint ed. Ralph Cook].Google Scholar

de Pijper, Jan R. 1983. Modelling British English intonation. Dordrecht: Foris Publications.Google Scholar

Pike, Kenneth L. 1945. The intonation of American English. Ann Arbor: University of Michigan Press.Google Scholar

Potter, Ralph, Kopp, George A., and Kopp, Harriet Green. 1966. Visible speech. New York: Dover Publications. [1947. Murray Hill, NJ: Bell Telephone Laboratories].Google Scholar

Shearme, John N. and Holmes, John N.. 1962. An experimental study of the classification of sounds in continuous speech according to their distribution in the formant 1–formant 2 plane. In Proceedings of the 4th international congress of phonetic sciences, Helsinki 1961. The Hague: Mouton.Google Scholar

Stevens, Ken N. 1968. On the relations between speech movements and speech perception. Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung 21: 102–106.Google Scholar

Story, Brad H. 2005. Physiologically-based speech simulation using an enhanced wavereflection model of the vocal tract. Doctoral dissertation, University of Iowa.Google Scholar

Story, Brad H. 2013. Phrase-level speech simulation with an airway modulation model of speech production. Computer Speech and Language 27(4): 989–1010. Accompanying speech samples at <http://sal-slhs.webhost.uits.arizona.edu/node/30>. Accessed 2016-09-23.Google Scholar

Story, Brad H. and Bunton, K.. 2011. Decomposition of vowel and consonant contributions to the time-varying vocal tract shape. Journal of the Acoustical Society of America 129(4): 2456.Google Scholar

Strevens, Peter. 1960. Spectra of fricative noise in human speech. Language and Speech 3(3): 32–49.Google Scholar

Strevens, Peter. 1961. Sibilant sounds of speech. The Dental Practitioner 11(11): 368–378.Google Scholar

Taube-Schock, Craig. 1993. Synthesizing intonation for computer speech output. Master's thesis, Department of Computer Science, University of Calgary, Calgary.Google Scholar

Taylor, Paul. 2009. Text-to-speech synthesis. Cambridge: Cambridge University Press.Google Scholar

Uldall, Elizabeth. 1964. Transitions in fricative noise. Language and Speech 7(1): 13–15.Google Scholar

Wells, John C. 1963. A study of the formants of the pure vowels of British English. Department of phonetics progress report, University College, London, London.Google Scholar

Willems, Nico, Collier, Ren, and ‘t Hart, Johan. 1988. A synthesis scheme for British English Intonation. Journal of the Acoustical Society of America 84(4): 1250–1261.Google Scholar

Witten, Ian H. 1977. A flexible scheme for assigning timing and pitch to synthetic speech. Language and Speech 20(3): 240–260.Google Scholar

Yamagishi, Junichi, Richmond, Korin, King, Simon, and many others [sic]. 2007. Hidden Markov model-based speech synthesis. Ms., Centre for Speech Technology Research, University of Edinburgh. Available at <http://homepages.inf.ed.ac.uk/ckiw/rpml/HMMspeechsynthesis.pdf>. Accessed 2015-02-18..+Accessed+2015-02-18.>Google Scholar

Article contents

Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool1

Abstract

Résumé

Keywords

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests