Introduction This study evaluated the performance of three artificial intelligence (AI) chatbots (GPT-3.5 (OpenAI, San Francisco, USA), GPT-4o (OpenAI, San Francisco, USA), and DeepSeek V3 0324 (DeepSeek AI, Beijing, China)) compared to eight gynecology residents in answering questions related to gestational diabetes mellitus (GDM), aiming to assess and compare the accuracy and completeness of responses to standardized patient questions on gestational diabetes in pregnancy. Methods Twenty-four questions were answered by three chatbots (GPT-3.5, GPT-4o, and DeepSeek V3 0324) and eight residents. Two faculty members independently rated the responses for accuracy and completeness using a 5-point scale. Independent-samples t-tests were used for statistical analysis. Results The mean accuracy scores were 3.64 for residents, 4.67 for GPT-3.5, 4.69 for GPT-4o, and 4.81 for DeepSeek V3 0324. The mean completeness scores were 2.05 for residents, 2.83 for GPT-3.5, 4.00 for GPT-4o, and 4.75 for DeepSeek V3 0324. T-tests showed that all AI models had significantly higher accuracy than residents (p < 0.001). Completeness scores were significantly higher for GPT-4o and DeepSeek V3 0324 (p < 0.001), while the difference between GPT-3.5 and residents for completeness was not statistically significant (p = 0.058). Conclusion AI models, particularly DeepSeek V3 0324 and GPT-4o, outperformed gynecology residents in both accuracy and completeness when answering GDM-related questions. These preliminary findings suggest that AI tools may complement medical education and clinical support, but further research is required before broader implementation.
Keywords: artificial intelligence; chatbots; gestational diabetes mellitus; pregnancy; residents.
Copyright © 2025, Faraji et al.