Classification of Voice Disorders Using a One-Dimensional Convolutional Neural Network

Shintaro Fujimura; Tsuyoshi Kojima; Yusuke Okanoue; Kazuhiko Shoji; Masato Inoue; Koichi Omori; Ryusuke Hori

doi:10.1016/j.jvoice.2020.02.009

Classification of Voice Disorders Using a One-Dimensional Convolutional Neural Network

J Voice. 2022 Jan;36(1):15-20. doi: 10.1016/j.jvoice.2020.02.009. Epub 2020 Mar 13.

Authors

Shintaro Fujimura¹, Tsuyoshi Kojima², Yusuke Okanoue³, Kazuhiko Shoji³, Masato Inoue⁴, Koichi Omori¹, Ryusuke Hori³

Affiliations

¹ Department of Otolaryngology-Head and Neck Surgery, Graduate School of Medicine, Kyoto University, Kyoto, Japan.
² Department of Otolaryngology, Tenri Hospital, Tenri, Nara, Japan. Electronic address: t_kojima@ent.kuhp.kyoto-u.ac.jp.
³ Department of Otolaryngology, Tenri Hospital, Tenri, Nara, Japan.
⁴ Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, Shinjuku, Tokyo, Japan.

PMID: 32173149
DOI: 10.1016/j.jvoice.2020.02.009

Abstract

Objectives: Auditory-perceptual voice analysis is a standard method for quantifying pathological voice quality, but perceptual ratings are based on subjective evaluations and therefore may vary among examiners. Although many acoustic metrics have been studied for potential use in the objective evaluation of pathological voices, the interpretation of acoustic metrics in individual cases is difficult and the technique is not widely used by clinicians. The aim of this study was to establish standardized methods to discriminate grade, roughness, breathiness, asthenia, strain (GRBAS) scale scores of pathological voices directly using one-dimensional convolutional neural network (1D-CNN) models.

Methods: We constructed an original dataset utilizing 1,377 voice samples of sustained phonation of the vowel /a/. Each voice sample was rated by three experts according to the GRBAS scale and the median values were used as the correct answer label. We designed an end-to-end 1D-CNN model with a raw voice waveform input having a frame width of 9,600 samples. The models were trained with our original dataset for each GRBAS category individually and the model performance was tested by the five-fold cross validation method.

Results: The accuracy, F1 score, and quadratic weighted Cohen's kappa for the testing dataset were determined. The metrics for the G scale showed the most balanced model performance, with high accuracy (0.771) and substantial agreement (kappa = 0.710). The model for the R scale had relatively high accuracy (0.765) and F1 score (0.743) with moderate agreement (kappa = 0.536). The accuracy (0.883) and the F1 score (0.865) for the S scale were the highest among the five categories, whereas the Cohen's kappa was the lowest (0.190).

Conclusions: The end-to-end 1D-CNN models can evaluate overall pathological voice quality with a reliability comparable to human evaluations. The efficiency with which the machine learning models can be trained and evaluated is closely related to the dataset quality.

Keywords: Auditory perceptual voice analysis; Deep learning; GRBAS scale; One-dimensional convolutional neural network; Voice disorder.

MeSH terms

Humans
Neural Networks, Computer
Reproducibility of Results
Speech Acoustics
Voice Disorders* / diagnosis
Voice Quality