Objectives: The performance of chatbots for discrete steps of a systematic review (SR) on artificial intelligence (AI) in pediatric dentistry was evaluated.
Methods: Two chatbots (ChatGPT4/Gemini) and two non-expert reviewers were compared against two experts in a SR on AI in pediatric dentistry. Five tasks: (1) formulating a PICO question, (2) developing search queries for eight databases, (3) screening studies, (4) extracting data, and (5) assessing the risk of bias (RoB) were assessed. Chatbots and non-experts received identical prompts, with experts providing the reference standard. Performance was measured using accuracy, precision, sensitivity, specificity, and F1-score for search and screening tasks, Cohen's Kappa for risk of bias assessment, and a modified Global Quality Score (1-5) for PICO question formulation and data extraction quality. Statistical comparisons were performed using Kruskal-Wallis and Dunn's post-hoc tests.
Results: In PICO formulation, ChatGPT outperformed Gemini slightly, while non-experts scored the lowest. Experts identified 1261 records, compared to 569 (ChatGPT), 285 (Gemini), and 722 (non-experts). Screening showed chatbots having 90 % sensitivity, >60 % specificity, <25 % precision, and F1-scores <40 %, versus non-experts' 84 % sensitivity, 91 % specificity, and 39 % F1-score, respectively. For data extraction, ChatGPT yielded a (mean±standard deviation) score of 31.6 ± 12.3 (max. was 45), Gemini 29.2 ± 12.3, and non-experts 30.4 ± 11.3, respectively. For RoB, the agreement with experts was 49.4 % for ChatGPT, 51.2 % for Gemini 48.8 % for non-experts (p > 0.05).
Conclusion: Chatbots could enhance SR efficiency, particularly for the study screening and data extraction steps. Human oversight remains critical for ensuring accuracy and completeness.
Clinical significance: Chatbots can streamline SR tasks like screening and data extraction, potentially accelerating evidence synthesis, though human oversight remains needed for reliability. The applicability of chatbots for SR steps was found dependent on the specific step, indicating reviewers need to make informed choices when employing chatbots for this purpose.
Keywords: Artificial intelligence; ChatGPT; Chatbot; Large language models; Pediatric dentistry.
Copyright © 2025 Elsevier Ltd. All rights reserved.