Comparative analysis of large language models as decision support tools in oral pathology

Sci Rep. 2026 Feb 27;16(1):11272. doi: 10.1038/s41598-026-41533-z.

Abstract

This study evaluated the performance of four large language model based chatbots (LLMs) (ChatGPT-4.0, ChatGPT o1-preview, Gemini, and Meta AI) as decision-support systems for interpreting histopathologic descriptions of oral lesions, assessing agreement between their s generated a suggested primary interpretation and three differential diagnoses. Outputs were categorized as Different, Similar, or Correct compared to the consensus reference diagnosis established by two board-certified pathologists. Statistical analyses included the Friedman test to compare model performance, Wilcoxon signed-rank tests for pairwise comparisons, Cohen's κ to assess agreement, and regression analyses to evaluate the influence of age and sex. Differential diagnosis performance was also analyzed. ChatGPT o1-preview demonstrated the highest proportion of outputs concordant with the reference diagnosis (68.6%), followed by Meta AI (65.7%), ChatGPT-4.0 (59.8%), and Gemini (27.5%). In terms of agreement with oral pathologists, ChatGPT o1-preview (κ = 0.66) and Meta AI (κ = 0.63) showed substantial agreement, ChatGPT-4.0 demonstrated moderate agreement (κ = 0.57), and Gemini showed poor agreement (κ = 0.24). Increasing patient age was associated with a mild but statistically significant reduction in model performance for ChatGPT-4.0, Meta AI, and Gemini, while no significant age effect was observed for ChatGPT o1-preview; patient sex had no significant impact. Among the evaluated chatbots, ChatGPT o1-preview showed the highest alignment with oral pathologists' reference diagnoses. These findings support the potential role of LLMs as complementary decision-support tools for interpreting oral histopathology descriptions, while highlighting substantial inter-model variability and the need for cautious implementation with continued human oversight.

Keywords: Artificial intelligence; ChatGPT; Chatbot; Gemini; Large language models; Meta AI; Oral and maxillofacial pathology.

Publication types

  • Comparative Study

MeSH terms

  • Adolescent
  • Adult
  • Aged
  • Decision Support Systems, Clinical*
  • Decision Support Techniques*
  • Diagnosis, Differential
  • Female
  • Humans
  • Language*
  • Large Language Models
  • Male
  • Middle Aged
  • Pathology, Oral* / methods
  • Young Adult