ChatGPT's performance in dentistry and allergyimmunology assessments: a comparative study

Alexander Fuchs; Tina Trachsel; Roland Weiger; Florin Eggmann

doi:10.61872/sdj-2024-06-01

ChatGPT's performance in dentistry and allergyimmunology assessments: a comparative study

Swiss Dent J. 2023 Oct 4;134(2):1-17. doi: 10.61872/sdj-2024-06-01.

Authors

Alexander Fuchs¹, Tina Trachsel², Roland Weiger¹, Florin Eggmann¹

Affiliations

¹ Department of Periodontology, Endodontology and Cariology, University Center for Dental Medicine Basel UZB, University of Basel, Basel, Switzerland. florin.eggmann@unibas.ch.
² Division of Allergy, University Children's Hospital Basel, Basel, Switzerland. florin.eggmann@unibas.ch.

PMID: 38726506
DOI: 10.61872/sdj-2024-06-01

Abstract

Large language models (LLMs) such as ChatGPT have potential applications in healthcare, including dentistry. Priming, the practice of providing LLMs with initial, relevant information, is an approach to improve their output quality. This study aimed to evaluate the performance of ChatGPT 3 and ChatGPT 4 on self-assessment questions for dentistry, through the Swiss Federal Licensing Examination in Dental Medicine (SFLEDM), and allergy and clinical immunology, through the European Examination in Allergy and Clinical Immunology (EEAACI). The second objective was to assess the impact of priming on ChatGPT's performance. The SFLEDM and EEAACI multiple-choice questions from the University of Bern's Institute for Medical Education platform were administered to both ChatGPT versions, with and without priming. Performance was analyzed based on correct responses. The statistical analysis included Wilcoxon rank sum tests (alpha=0.05). The average accuracy rates in the SFLEDM and EEAACI assessments were 63.3% and 79.3%, respectively. Both ChatGPT versions performed better on EEAACI than SFLEDM, with ChatGPT 4 outperforming ChatGPT 3 across all tests. ChatGPT 3's performance exhibited a significant improvement with priming for both EEAACI (p=0.017) and SFLEDM (p=0.024) assessments. For ChatGPT 4, the priming effect was significant only in the SFLEDM assessment (p=0.038). The performance disparity between SFLEDM and EEAACI assessments underscores ChatGPT's varying proficiency across different medical domains, likely tied to the nature and amount of training data available in each field. Priming can be a tool for enhancing output, especially in earlier LLMs. Advancements from ChatGPT 3 to 4 highlight the rapid developments in LLM technology. Yet, their use in critical fields such as healthcare must remain cautious owing to LLMs' inherent limitations and risks.

Keywords: Allergology; Artificial intelligence; Clinical immunology; Dental education; Machine learning; Medical informatics applications.

Publication types

Comparative Study

MeSH terms

Allergy and Immunology* / education
Clinical Competence
Education, Dental
Educational Measurement*
Humans
Switzerland