Hostname: page-component-cd9895bd7-gvvz8 Total loading time: 0 Render date: 2024-12-23T12:19:43.433Z Has data issue: false hasContentIssue false

Spoken Arabic dialect recognition using X-vectors

Published online by Cambridge University Press:  04 May 2020

Abualsoud Hanani*
Affiliation:
Electrical and Computer Engineering, Birzeit University, Palestine
Rabee Naser
Affiliation:
Electrical and Computer Engineering, Birzeit University, Palestine
*
*Corresponding author. E-mail: [email protected]

Abstract

This paper describes our automatic dialect identification system for recognizing four major Arabic dialects, as well as Modern Standard Arabic. We adapted the X-vector framework, which was originally developed for speaker recognition, to the task of Arabic dialect identification (ADI). The training and development ADI VarDial 2018 and VarDial 2017 were used to train and test all of our ADI systems. In addition to the introduced X-vectors, other systems use the traditional i-vectors, bottleneck features, phonetic features, words transcriptions, and GMM-tokens. X-vectors achieved good performance (0.687) on the ADI 2018 Discriminating between Similar Languages shared task testing dataset, outperforming other systems. The performance of the X-vector system is slightly improved (0.697) when fused with i-vectors, bottleneck features, and word uni-gram features.

Type
Article
Copyright
© Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ali, A., Dehak, N., Cardinal, P., Khurana, S., Yella, S.H., Glass, J., Bell, P. and Renals, S. (2015). Automatic dialect detection in arabic broadcast speech. arXiv preprint arXiv:1509.06928.Google Scholar
Ali, A., Zhang, Y., Cardinal, P., Dahak, N., Vogel, S. and Glass, J. (2014). A complete Kaldi recipe for building Arabic speech recognition systems. In 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 525529. IEEE.CrossRefGoogle Scholar
Ali, A., Zhang, Y. and Vogel, S. (2014). QCRI advanced transcription system (QATS). Proceedings of SLT.Google Scholar
Brümmer, N. (2007). Focal multi-class: Toolkit for evaluation, fusion and calibration of multi-class recognition scorestutorial and user manual. Software. Available at http://sites.google.com/site/nikobrummer/focalmulticlass/ Google Scholar
Çöltekin, Ç. and Rama, T. (2017). Tübingen system in vardial 2017 shared task: Experiments with language identification and cross-lingual parsing. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 146155.CrossRefGoogle Scholar
Dehak, N., Dehak, R., Glass, J.R., Reynolds, D.A., Kenny, P., et al. (2010). Cosine similarity scoring without score normalization techniques. In Odyssey, p. 15.Google Scholar
DeMarco, A. and Cox, S.J. (2013). Native accent classification via i-vectors and speaker compensation fusion. In INTERSPEECH, pp. 14721476.Google Scholar
Du, J., Wang, Q., Gao, T., Xu, Y., Dai, L.-R. and Lee, C.-H. (2014). Robust speech recognition with speech enhanced deep neural networks. In Fifteenth Annual Conference of the International Speech Communication Association.Google Scholar
Eldesouki, M., Dalvi, F., Sajjad, H. and Darwish, K. (2016). Qcri@ dsl 2016: Spoken arabic dialect identification using textual features. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 221226.Google Scholar
Elfardy, H. and Diab, M. (2013). Sentence level dialect identification in arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 456461.Google Scholar
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. and Lin, C.-J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research 9(Aug), pp. 18711874.Google Scholar
Garcia-Romero, D., Snyder, D., Sell, G., Povey, D. and McCree, A. (2017). Speaker diarization using deep neural network embeddings. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 49304934. IEEE.CrossRefGoogle Scholar
Habash, N. Y. (2010). Introduction to arabic natural language processing. Synthesis Lectures on Human Language Technologies 3(1), 1187.CrossRefGoogle Scholar
Hanani, A., Qaroush, A. and Taylor, S. (2017). Identifying dialects with textual and acoustic cues. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 93101.CrossRefGoogle Scholar
Hanani, A., Russell, M.J. and Carey, M.J. (2013). Human and computer recognition of regional accents and ethnic groups from british english speech. Computer Speech & Language 27(1), 5974.CrossRefGoogle Scholar
Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 114.Google Scholar
Najafian, M., Khurana, S., Shan, S., Ali, A. and Glass, J. (2018). Exploiting convolutional neural networks for phonotactic based dialect identification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 51745178. IEEE.CrossRefGoogle Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Number EPFL-CONF-192584. IEEE Signal Processing Society.Google Scholar
Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D. and Khudanpur, S. (2018a). Spoken language recognition using x-vectors. In Odyssey: The Speaker and Language Recognition Workshop, Les Sables dOlonne.CrossRefGoogle Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. and Khudanpur, S. (2018b). X-vectors: Robust DNN embeddings for speaker recognition. Submitted to ICASSP.Google Scholar
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y. and Khudanpur, S. (2016). Deep neural network-based speaker embeddings for end-to-end speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 165170. IEEE.CrossRefGoogle Scholar
Torres-Carrasquillo, P.A., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A. and Deller, J.R., Jr. (2002). Approaches to language identification using gaussian mixture models and shifted delta Cepstral features. In Seventh International Conference on Spoken Language Processing.Google Scholar
Tüske, Z., Golik, P., Schlüter, R. and Ney, H. (2014). Acoustic modeling with deep neural networks using raw time signal for LVCSR. In Fifteenth Annual Conference of the International Speech Communication Association.Google Scholar
Wray, S. and Ali, A. (2015). Crowdsource a little to label a lot: Labeling a speech corpus of dialectal arabic. In Sixteenth Annual Conference of the International Speech Communication Association.Google Scholar
Zaidan, O.F. and Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics 40(1), 171202.CrossRefGoogle Scholar
Zirikly, A., Desmet, B. and Diab, M. (2016). The GW/LT3 vardial 2016 shared task system for dialects and similar languages detection. In COLING, pp. 3341. The COLING 2016 Organizing Committee.Google Scholar