Crossmixed convolutional neural network for digital speech recognition

Quoc Bao Diep; Hong Yen Phan; Thanh-Cong Truong

doi:10.1371/journal.pone.0302394

Crossmixed convolutional neural network for digital speech recognition

PLoS One. 2024 Apr 26;19(4):e0302394. doi: 10.1371/journal.pone.0302394. eCollection 2024.

Authors

Quoc Bao Diep¹, Hong Yen Phan¹, Thanh-Cong Truong²

Affiliations

¹ Faculty of Mechanical - Electrical and Computer Engineering, Van Lang University, Ho Chi Minh City, Vietnam.
² Faculty of Information Technology, University of Finance-Marketing, Ho Chi Minh City, Vietnam.

Abstract

Digital speech recognition is a challenging problem that requires the ability to learn complex signal characteristics such as frequency, pitch, intensity, timbre, and melody, which traditional methods often face issues in recognizing. This article introduces three solutions based on convolutional neural networks (CNN) to solve the problem: 1D-CNN is designed to learn directly from digital data; 2DS-CNN and 2DM-CNN have a more complex architecture, transferring raw waveform into transformed images using Fourier transform to learn essential features. Experimental results on four large data sets, containing 30,000 samples for each, show that the three proposed models achieve superior performance compared to well-known models such as GoogLeNet and AlexNet, with the best accuracy of 95.87%, 99.65%, and 99.76%, respectively. With 5-10% higher performance than other models, the proposed solution has demonstrated the ability to effectively learn features, improve recognition accuracy and speed, and open up the potential for broad applications in virtual assistants, medical recording, and voice commands.

Copyright: © 2024 Diep et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Humans
Neural Networks, Computer*
Speech / physiology
Speech Recognition Software*

Grants and funding

The author(s) received no specific funding for this work.