Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis

JaeWon Yang; Kyle S Ardavanis; Katherine E Slack; Navin D Fernando; Craig J Della Valle; Nicholas M Hernandez

doi:10.1016/j.arth.2024.01.029

Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis

J Arthroplasty. 2024 May;39(5):1184-1190. doi: 10.1016/j.arth.2024.01.029. Epub 2024 Jan 17.

Authors

JaeWon Yang¹, Kyle S Ardavanis², Katherine E Slack³, Navin D Fernando¹, Craig J Della Valle⁴, Nicholas M Hernandez¹

Affiliations

¹ Department of Orthopaedic Surgery, University of Washington, Seattle, Washington.
² Department of Orthopaedic Surgery, Madigan Medical Center, Tacoma, Washington.
³ Elson S. Floyd College of Medicine, Washington State University, Spokane, Washington.
⁴ Department of Orthopaedic Surgery, Rush University Medical Center, Chicago, Illinois.

PMID: 38237878
DOI: 10.1016/j.arth.2024.01.029

Abstract

Background: Advancements in artificial intelligence (AI) have led to the creation of large language models (LLMs), such as Chat Generative Pretrained Transformer (ChatGPT) and Bard, that analyze online resources to synthesize responses to user queries. Despite their popularity, the accuracy of LLM responses to medical questions remains unknown. This study aimed to compare the responses of ChatGPT and Bard regarding treatments for hip and knee osteoarthritis with the American Academy of Orthopaedic Surgeons (AAOS) Evidence-Based Clinical Practice Guidelines (CPGs) recommendations.

Methods: Both ChatGPT (Open AI) and Bard (Google) were queried regarding 20 treatments (10 for hip and 10 for knee osteoarthritis) from the AAOS CPGs. Responses were classified by 2 reviewers as being in "Concordance," "Discordance," or "No Concordance" with AAOS CPGs. A Cohen's Kappa coefficient was used to assess inter-rater reliability, and Chi-squared analyses were used to compare responses between LLMs.

Results: Overall, ChatGPT and Bard provided responses that were concordant with the AAOS CPGs for 16 (80%) and 12 (60%) treatments, respectively. Notably, ChatGPT and Bard encouraged the use of non-recommended treatments in 30% and 60% of queries, respectively. There were no differences in performance when evaluating by joint or by recommended versus non-recommended treatments. Studies were referenced in 6 (30%) of the Bard responses and none (0%) of the ChatGPT responses. Of the 6 Bard responses, studies could only be identified for 1 (16.7%). Of the remaining, 2 (33.3%) responses cited studies in journals that did not exist, 2 (33.3%) cited studies that could not be found with the information given, and 1 (16.7%) provided links to unrelated studies.

Conclusions: Both ChatGPT and Bard do not consistently provide responses that align with the AAOS CPGs. Consequently, physicians and patients should temper expectations on the guidance AI platforms can currently provide.

Keywords: ChatGPT; artificial intelligence; bard; large language models; machine learning.

MeSH terms

Artificial Intelligence
Humans
Language
Osteoarthritis, Hip* / therapy
Osteoarthritis, Knee* / therapy
Reproducibility of Results