Evaluating ChatGPT's Adherence to Hoarseness Guidelines: A Three-Rater Study Including an Otolaryngologist, an Audiologist, and the Model Itself

J Voice. 2026 Jan 2:S0892-1997(25)00538-7. doi: 10.1016/j.jvoice.2025.12.015. Online ahead of print.

Abstract

Objective: To assess the alignment of Chat Generative Pre-trained Transformer (ChatGPT), based on Generative Pre-trained Transformer 4 (GPT-4) with the 2018 Clinical Practice Guideline on Hoarseness (Dysphonia), using a structured three-rater evaluation involving an otolaryngologist, an audiologist, and ChatGPT.

Methods: Thirteen guideline statements were converted into 15 open-ended clinical questions and independently answered by ChatGPT. Responses were assessed for consistency with the guideline using a three-point scale (consistent, partially consistent, inconsistent). Evaluations were performed by an otolaryngologist, an audiologist, and ChatGPT itself, with final adjudication by a senior otolaryngologist.

Results: Of 15 items, 13 responses (86.7%) were rated as fully consistent by all three raters. Two responses (13.3%) were rated as partially consistent by one evaluator each. No responses were deemed inconsistent. Overall agreement across raters was 97.8%.

Conclusion: ChatGPT's responses showed high concordance with expert recommendations in the evaluation and management of hoarseness. These findings support the potential of large language models as adjunctive tools for patient education and clinical decision-making in voice disorders, when used under expert oversight.

Keywords: Artificial intelligence; ChatGPT; Clinical guideline adherence; Dysphonia; Hoarseness; Voice disorders.