Background: Artificial intelligence, including large language models (LLMs) such as GPT-4, can generate responses to clinical queries using predictive algorithms trained on large online datasets. Current literature lacks a comprehensive assessment of the medical quality and accuracy of dermatologic GPT-4-generated outputs.
Methods: A standardized query was used to ask GPT-4 models (Copilot and ChatGPT-4) to generate summaries and treatment recommendations for 33 dermatologic conditions, which were then compared to corresponding sections of UpToDate (UTD) excerpts. DISCERN scores were calculated for each source by two authors (AN and PV). Concordance between GPT-4-generated treatments and UTD was evaluated by a certified dermatologist. Word count and Flesch Kincaid reading score were generated in R. Paired t-tests and one-way and weighted ANOVA were conducted in R.
Results: The DISCERN instrument classified UTD content as being of "fair" medical quality (mean [SD], 3.08 [0.34]), while both ChatGPT-4 and Copilot produced content of "poor" medical quality (mean [SD], 2.28 [0.22] and mean [SD], 2.31 [0.35], respectively). ChatGPT-4's treatment recommendations demonstrated 33.5% greater average concordance with UTD treatment recommendations (mean [SD], 64.89% [29.29]), in comparison to Copilot (mean [SD], 31.38% [31.08%]); (95% CI, 22.3%-44.7%, p < 0.001).
Conclusions: Overall, GPT-4 models produced dermatological content with few harmful recommendations. However, GPT-4-generated content performed poorly on the DISCERN instrument, and validation of LLM-generated responses remains challenging. Results suggest LLM parameters and query structures may be optimizable for dermatologic applications. If implemented alongside the professional judgement of certified dermatologists, future LLMs may serve as time-saving dermatologic tools, enhancing patient care.
Keywords: AI; AI in dermatology; ChatGPT; artificial intelligence.
© 2026 The Author(s). Skin Research and Technology published by John Wiley & Sons Ltd.