Purpose: To assess whether large language models (LLMs) with advanced reasoning and live web search (LWS) provide recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS) for anterior cruciate ligament (ACL) and rotator cuff (RC) injury management.
Methods: Recommendations from CPGs were extracted and developed into a total of 46 questions (n = 15 for ACL, n = 31 for RC). Four configurations were evaluated: GPT-5 Thinking, GPT-5 Thinking Deep Research, Gemini 2.5 Pro, Gemini 2.5 Pro Deep Research. Concordance with CPGs, the primary endpoint, was independently evaluated by two orthopaedic surgeons. Citation integrity, the secondary endpoint, was evaluated against four criteria: 1-relevance, ensuring the citation was congruent with the response; 2-accuracy, confirming the citation metadata were correct; 3-existence, to rule out hallucinations; and 4-source quality, ensuring the cited source is from a peer-reviewed journal. Blinding was performed by a third investigator, by anonymously randomising the order of LLM-generated responses for each CPG recommendation.
Results: All LLMs answered ACL questions concordantly (100% [15/15]; 95% confidence interval [CI]: 78.2%-100%). For RC questions, GPT-5 Thinking and Gemini 2.5 Pro Deep Research each had one discordant answer (96.8% [30/31]; 95% CI: 83.3%-99.9%), whereas the other two configurations were fully concordant (100% [31/31]; 95% CI: 88.7%-100%). GPT-5 Thinking achieved 96.8% (231/239; 95% CI: 93.6%-98.6%) citation integrity, improving to 100% (176/176; 95% CI: 97.9%-100%) with Deep Research. Gemini 2.5 Pro showed substantially lower baseline performance (64.6% [173/268]; 95% CI: 58.5%-70.3%) but improved to 98.6% (274/278; 95% CI: 96.4%-99.6%) with Deep Research. Inter-rater agreement was perfect (κ = 1.0) across all domains, except for citation relevance, which maintained strong agreement (κ = 0.88).
Conclusions: Contemporary LLMs with agentic capabilities can deliver clinically aligned answers concordant with CPGs on ACL and RC injuries, recovering from previous hallucinations. Built-in LWS functions are particularly helpful in ensuring citation reliability. Although expert oversight remains imperative, Deep Research allows LLMs to be considered as a first-pass clinical reasoning companion.
Level of evidence: NA.
Keywords: deep research; guideline concordance; large language models; literature citation integrity; live web search.
© 2026 European Society of Sports Traumatology, Knee Surgery and Arthroscopy.