Multi-model Artificial Intelligence Evaluation in Sudden Sensorineural Hearing Loss

Otolaryngol Head Neck Surg. 2026 Apr;174(4):980-988. doi: 10.1002/ohn.70143. Epub 2026 Jan 28.

Abstract

Objective: To compare the diagnostic accuracy, linguistic clarity, and user satisfaction of three large language models (ChatGPT-4.0, Claude 3.7 Sonet, and OpenAI Mini 3) in managing sudden sensorineural hearing loss.

Study design: Prospective, multi-domain comparative analysis using blinded expert evaluation.

Setting: Online artificial intelligence (AI) platforms accessed under standardized conditions.

Methods: Twenty-seven sudden sensorineural hearing loss-related questions-covering general knowledge, audiometric interpretation, and clinical case scenarios-were submitted to the three AI models. Responses were evaluated by 10 board-certified otolaryngologists using three validated tools: Quality Assessment of Medical Artificial Intelligence (QAMAI), Artificial Intelligence Performance Instrument (AIPI), and Artificial Intelligence Satisfaction and Performance Evaluation Questionnaire (AISPE-Q). Linguistic complexity was assessed using metrics such as word count, sentence length, lexical diversity, and clinical verb use.

Results: ChatGPT-4.0 demonstrated the highest scores in clinical accuracy (QAMAI: 4.57), completeness (4.53), and evaluator satisfaction (AISPE-Q: 94%). Claude 3.7 outperformed in clarity and sentence complexity, while OpenAI Mini 3 exhibited the highest lexical diversity and directive tone but scored lower overall. Inter-rater reliability was strong (intraclass correlation coefficient [ICC] > 0.85). Correlation analysis revealed a significant relationship between objective quality and subjective satisfaction (r > 0.76).

Conclusion: ChatGPT-4.0 delivered the most clinically aligned and satisfactory responses, whereas Claude 3.7 provided linguistically refined outputs. Our findings support the context-specific application of hybrid large language model approaches in otolaryngology, particularly for patient education, diagnosis, and AI-driven triage.

Level of evidence: 2-prospective comparative diagnostic accuracy study.

Keywords: ChatGPT; Claude Sonet; SSHL; artificial intelligence; large language models; linguistic analysis; otolaryngology.

Publication types

  • Comparative Study

MeSH terms

  • Artificial Intelligence*
  • Generative Artificial Intelligence
  • Hearing Loss, Sensorineural* / diagnosis
  • Hearing Loss, Sudden* / diagnosis
  • Humans
  • Intelligent Systems
  • Large Language Models
  • Prospective Studies