Purpose: The integration of large language models (LLMs) such as ChatGPT into health care has garnered increasing interest. While previous studies have assessed these models using structured multiple-choice questions, limited research has evaluated their performance on open-ended, scenario-based clinical tasks, particularly in dentistry. This study aimed to evaluate and compare the clinical reasoning capabilities of ChatGPT-3.5 and GPT-4 in formulating treatment plans across seven dental specialties using realistic, open-ended clinical scenarios.
Methods: A cross-sectional analytical study, reported in accordance with the STROBE guidelines, was conducted using 70 dental cases spanning endodontics, oral and maxillofacial surgery, oral medicine, orthodontics, paediatric dentistry, periodontology, and radiology. Each case was submitted to both ChatGPT-3.5 and GPT-4 (paid version, November 2024). Responses were evaluated by specialty-specific expert panels using a three-level rubric (poor, average, good). Statistical analyses included chi-square tests and Fisher-Freeman-Halton exact tests (α = 0.05).
Results: GPT-4 significantly outperformed GPT-3.5 in overall response quality (67.1% vs. 44.3% rated as 'good'; p = 0.016). Although no significant differences were observed across most specialties, GPT-4 showed a statistically superior performance in oral and maxillofacial surgery. Its advantage was more pronounced in complex cases, aligning with the model's enhanced contextual reasoning.
Conclusion: GPT-4 demonstrated superior accuracy and consistency compared to GPT-3.5, particularly in clinically complex and integrative tasks. These findings support the potential of advanced LLMs as adjunct tools in dental education and decision-making, though specialty-specific applications and expert oversight remain essential.
Keywords: dental care; dental education; large language models.
© 2025 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.