Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Zeyu Ren; Nurmemet Yolwas; Wushour Slamu; Ronghe Cao; Huiru Wang

doi:10.3390/s22197319

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Sensors (Basel). 2022 Sep 27;22(19):7319. doi: 10.3390/s22197319.

Authors

Zeyu Ren¹, Nurmemet Yolwas¹, Wushour Slamu¹, Ronghe Cao², Huiru Wang²

Affiliations

¹ Xinjiang Multilingual Information Technology Laboratory, Xinjiang Multilingual Information Technology Research Center, College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China.
² College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China.

Abstract

Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice-Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems.

Keywords: MSPC; agglutinative language speech recognition; data augmentation; hybrid CTC/attention architecture; low-resource.

MeSH terms

Attention
Language
Speech Perception*
Speech Recognition Software
Speech*

Abstract

MeSH terms

Grants and funding