Automatic Speech Recognition by Machines

doi:10.1017/9781108644198.020

19 - Automatic Speech Recognition by Machines

from Section IV - Audition and Perception

Published online by Cambridge University Press: 11 November 2021

Sabato Marco Siniscalchi and

Chin-Hui Lee

Edited by

Rachael-Anne Knight and

Jane Setter

Show author details

Rachael-Anne Knight: Affiliation:
City, University of London
Jane Setter: Affiliation:
University of Reading

Book contents

Get access

Summary

Building machines to converse with human beings through automatic speech recognition (ASR) and understanding (ASU) has long been a topic of great interest for scientists and engineers, and we have recently witnessed rapid technological advances in this area. Here, we first cast the ASR problem as a pattern-matching and channel-decoding paradigm. We then follow this with a discussion of the Hidden Markov Model (HMM), which is the most successful technique for modelling fundamental speech units, such as phones and words, in order to solve ASR as a search through a top-down decoding network. Recent advances using deep neural networks as parts of an ASR system are also highlighted. We then compare the conventional top-down decoding approach with the recently proposed automatic speech attribute transcription (ASAT) paradigm, which can better leverage knowledge sources in speech production, auditory perception and language theory through bottom-up integration. Finally we discuss how the processing-based speech engineering and knowledge-based speech science communities can work collaboratively to improve our understanding of speech and enhance ASR capabilities.

Keywords

Automatic speech recognition knowledge-based speech recognition deep neural networks Hidden Markov Models word hypotheses rescoring

Type: Chapter
Information: The Cambridge Handbook of Phonetics , pp. 480 - 500

DOI: https://doi.org/10.1017/9781108644198.020 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

19.7 References

Allen, J. (1994). How do humans process and recognize speech. IEEE Transactions on Speech and Audio Processing, 2(4) 567–77.CrossRef Google Scholar

Baker, J. K. (1975). The DRAGON System: An overview. IEEE Transactions on Acoustics, Speech and Signal Processing, 23(1), 24–9.Google Scholar

Bourlard, H. A. & Morgan, N. (1994). Connectionist Speech Recognition: A Hybrid Approach. Berlin: Springer-Verlag.Google Scholar

Chan, W., Jaitly, N., Le, Q. & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, Shanghai, pp. 4960–4.Google Scholar

Cherry, C. (1968). On Human Communications. Cambridge, MA: MIT Press.Google Scholar

Cohen, M. H., Giangola, J. P. & Balogh, J. (2004). Voice User Interface Design. Hoboken, NJ: Anderson-Wiley.Google Scholar

Davis, K. H., Biddulph, R. & Balashek, S. (1952). Automatic recognition of spoken digits. Journal of the Acoustical Society of America, 24(6), 637–42.Google Scholar

Denes, P. E. & Pinson, E. N. (1993). The Speech Chain: The Physics and Biology of Spoken Languages, 2nd ed. Oxford: W. H. Freeman and Company.Google Scholar

Fant, G. (1960). Acoustic Theory of Speech Production. The Hague: Mouton.Google Scholar

Fant, G. (1973). Speech Sounds and Features. Cambridge, MA: MIT Press.Google Scholar

Flanagan, J. L. (1965). Speech Analysis, Synthesis and Perception. Berlin: Springer-Verlag.Google Scholar

Forgie, J. W. & Forgie, C. D. (1959). Results obtained from a vowel recognition computer program. Journal of the Acoustical Society of America, 31(11), 1480–89.Google Scholar

Gold, B. & Morgan, N. (1999). Speech and Audio Signal Processing. New York: Wiley.Google Scholar

Graves, A., Fernández, S., Gomez, F. & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pp. 369–76.Google Scholar

Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E. et al. (2014). Deep speech: Scaling up end-to-end speech recognition. In arXiv preprint arXiv:1412.5567.Google Scholar

Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–7.Google Scholar

Hinton, G. E., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N. et al. (2012). Deep neural networks for acoustic modelling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.Google Scholar

Huang, X., Acero, A. & Hong, H.-W. (2001). Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Upper Saddle River, NJ: Prentice Hall.Google Scholar

Jelinek, F. (1997). Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press.Google Scholar

Juang, B. H. & Furui, S. (2000). Automatic speech recognition and understanding: A first step toward natural human–machine communication. Proceedings of the IEEE, 88(8), 1142–65.Google Scholar

Juneja, A., Deshmukh, O. & Espy-Wilson, C. (2002). An event-based acoustic-phonetic approach to speech segmentation and E-set recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 4: IV/4164.Google Scholar

Jurafsky, D. & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Google Scholar

Klatt, D. (1977). Review of the ARPA Speech Understanding Project. Journal of the Acoustical Society of America, 62(6), 1324–66.Google Scholar

Lee, C. H. & Rabiner, L. R. (1989). A frame-synchronous network search algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(11), 1649–58.Google Scholar

Lee, C. H., Soong, F. K. & Paliwal, K. K. (1996). Automatic Speech and Speaker Recognition: Advanced Topics. Dordrecht: Kluwer Academic.Google Scholar

Lee, C.-H. & Huo, Q. (2000). On adaptive decision rules and decision parameter adaptation for automatic speech recognition. Proceedings of the IEEE, 88(8), 1241–69.Google Scholar

Lee, C.-H. & Siniscalchi, S. M. (2013). An information-extraction approach to speech processing: Analysis, detection, verification and recognition. Proceedings of the IEEE, 101(5), 1089–115.Google Scholar

Liu, S. A. (1996). Landmark detection for distinctive feature-based speech recognition. Journal of the Acoustical Society of America, 100(5), 3417–30.Google Scholar

Lippmann, R. P. (1997). Speech recognition by machines and humans. Speech Communication, 22(1), 1–15.CrossRef Google Scholar

Lowerre, B. (1990). The HARPY speech understanding system. In Lea, W., ed., Trends in Speech Recognition. Upper Saddle River, NJ: Prentice Hall, pp. 576–86.Google Scholar

Manning, C. & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.Google Scholar

Martin, T. B., Nelson, A. L. & Zadell, H. J. (1964). Speech Recognition by Feature-Abstraction Techniques. Tech Report AL-TDR-64–176, Air Force Avionics Lab.Google Scholar

Mohri, M., Pereira, F. C. N. & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16, 69–88.Google Scholar

Nagata, K., Kato, Y. & Chiba, S. (1963). Spoken Digit Recognizer for Japanese Language. NEC Research and Development Laboratories.Google Scholar

Ney, H. & Ortmanns, S. (2000). Progress in dynamic programming search for LVCSR. Proceedings of the IEEE, 88(8), 1224–40.Google Scholar

Olive, J. P., Greenwood, A. & Coleman, J. (1993). Acoustics of American English Speech: A Dynamic Approach. Berlin: Springer-Verlag.Google Scholar

Olson, H. F. & Belar, H. (1956). Phonetic typewriter. Journal of the Acoustical Society of America, 28(6), 1072–81.CrossRef Google Scholar

O’Shaughnessy, D. (2000). Speech Communications: Human and Machine. Reading, MA: Addison-Wesley.Google Scholar

Ostendorf, M. (1999). Moving beyond the beads-on-a-string model of speech. In Proceedings of. IEEE ASRU Automatic Speech Recognition and Understanding, Singapore, pp. 79–84.Google Scholar

Ostendorf, M., Digalakis, V. V. & Kimball, O. A. (1996). From HMM’s to segment models: A unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5), 360–78.Google Scholar

Paul, D. B. & Baker, J. M. (1992). The design for the Wall Street Journal-based CSR Corpus. In Proceedings of the Workshop on Speech and Natural Language, pp. 899–902.Google Scholar

Rabiner, L. R. (1989). A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the. IEEE, 77(2), 257–86.Google Scholar

Rabiner, L. R. & Juang, B.-H. (1993). Fundamentals of Speech Recognition. Upper Saddle River, NJ: Prentice Hall.Google Scholar

Rabiner, L. R. & Schafer, R. W. (2010). Theory and Applications of Digital Speech Processing. Upper Saddle River, NJ: Prentice Hall.Google Scholar

Ramabhadran, B., Chen, N. F., Harper, M. P., Kingsbury, B. & Knill, K. (2017). Introduction to the special issue on end-to-end speech and language processing. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1237–9.Google Scholar

Sainath, T. N., Weiss, R. J., Wilson, K. W., Li, B., Narayanan, A., Variani, E. et al. (2017). Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE /ACM Transactions on Audio, Speech, and Language Processing, 25, 965–79.Google Scholar

Sakoe, H. (1979). Two-level DP matching: A dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27, 588–95.Google Scholar

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27, 379–423 & 623–56.Google Scholar

Siniscalchi, S. M. & Lee, C.-H. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51, 1139–53.Google Scholar

Sproat, R. (1998). Multilingual Text-to-Speech Synthesis: The Bell Labs Approach, Dordrecht: Kluwer Academic.Google Scholar

Stevens, K. (2000). Acoustic Phonetics. Cambridge, MA: MIT Press.Google Scholar

Stork, D. G. (1997). HAL’s Legacy: 2001’s Computer as Dream and Reality. Cambridge, MA: MIT Press.Google Scholar

Sundermeyer, M., Schlüter, R. & Ney, H. (2012). LSTM neural networks for language modelling. In Proceedings of INTERSPEECH, Portland, OR, 194–6.Google Scholar

Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge: Cambridge University Press.CrossRef Google Scholar

Thomáš, M. (2012). Statistical Language Models Based on Neural Networks. PhD thesis, Brno University of Technology.Google Scholar

Vintsyuk, T. K. (1968). Speech discrimination by dynamic programming. Kibernetika, 4(2), 81–8.Google Scholar

Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–9.CrossRef Google Scholar

Yu, D. & Deng, L. (2014). Automatic Speech Recognition: A Deep Learning Approach. Berlin: Springer-Verlag.Google Scholar