Application of a Domain-specific BERT for Detection of Speech Recognition Errors in Radiology Reports

Gunvant R Chaudhari; Tengxiao Liu; Timothy L Chen; Gabby B Joseph; Maya Vella; Yoo Jin Lee; Thienkhai H Vu; Youngho Seo; Andreas M Rauschecker; Charles E McCulloch; Jae Ho Sohn

doi:10.1148/ryai.210185

Application of a Domain-specific BERT for Detection of Speech Recognition Errors in Radiology Reports

Radiol Artif Intell. 2022 May 25;4(4):e210185. doi: 10.1148/ryai.210185. eCollection 2022 Jul.

Affiliation

¹ Department of Radiology and Biomedical Imaging (G.R.C., T.L., T.L.C., G.B.J., M.V., Y.J.L., T.H.V., Y.S., A.M.R., J.H.S.) and Department of Epidemiology and Statistics (C.E.M.), University of California San Francisco, 505 Parnassus Ave, San Francisco, CA 94143.

Abstract

Purpose: To develop radiology domain-specific bidirectional encoder representations from transformers (BERT) models that can identify speech recognition (SR) errors and suggest corrections in radiology reports.

Materials and methods: A pretrained BERT model, Clinical BioBERT, was further pretrained on a corpus of 114 008 radiology reports between April 2016 and August 2019 that were retrospectively collected from two hospitals. Next, the model was fine-tuned on a training dataset of generated insertion, deletion, and substitution errors, creating Radiology BERT. This model was retrospectively evaluated on an independent dataset of radiology reports with generated errors (n = 18 885) and on unaltered report sentences (n = 2000) and prospectively evaluated on true clinical SR errors (n = 92). Correction Radiology BERT was separately trained to suggest corrections for detected deletion and substitution errors. Area under the receiver operating characteristic curve (AUC) and bootstrapped 95% CIs were calculated for each evaluation dataset.

Results: Radiology-specific BERT had AUC values of >.99 (95% CI: >0.99, >0.99), 0.94 (95% CI: 0.93, 0.94), 0.98 (95% CI: 0.98, 0.98), and 0.97 (95% CI: 0.97, 0.97) for detecting insertion, deletion, substitution, and all errors, respectively, on the independently generated test set. Testing on unaltered report impressions revealed a sensitivity of 82% (28 of 34; 95% CI: 70%, 93%) and specificity of 88% (1521 of 1728; 95% CI: 87%, 90%). Testing on prospective SR errors showed an accuracy of 75% (69 of 92; 95% CI: 65%, 83%). Finally, the correct word was the top suggestion for 45.6% (475 of 1041; 95% CI: 42.5%, 49.3%) of errors.

Conclusion: Radiology-specific BERT models fine-tuned on generated errors were able to identify SR errors in radiology reports and suggest corrections.Keywords: Computer Applications, Technology Assessment Supplemental material is available for this article. © RSNA, 2022See also the commentary by Abajian and Cheung in this issue.

Keywords: Computer Applications; Technology Assessment.