neuroGPT-X: toward a clinic-ready large language model

J Neurosurg. 2023 Oct 6;140(4):1041-1053. doi: 10.3171/2023.7.JNS23573. Print 2024 Apr 1.

Abstract

Objective: The objective was to assess the performance of a context-enriched large language model (LLM) compared with international neurosurgical experts on questions related to the management of vestibular schwannoma. Furthermore, another objective was to develop a chat-based platform incorporating in-text citations, references, and memory to enable accurate, relevant, and reliable information in real time.

Methods: The analysis involved 1) creating a data set through web scraping, 2) developing a chat-based platform called neuroGPT-X, 3) enlisting 8 expert neurosurgeons across international centers to independently create questions (n = 1) and to answer (n = 4) and evaluate responses (n = 3) while blinded, and 4) analyzing the evaluation results on the management of vestibular schwannoma. In the blinded phase, all answers were assessed for accuracy, coherence, relevance, thoroughness, speed, and overall rating. All experts were unblinded and provided their thoughts on the utility and limitations of the tool. In the unblinded phase, all neurosurgeons provided answers to a Likert scale survey and long-answer questions regarding the clinical utility, likelihood of use, and limitations of the tool. The tool was then evaluated on the basis of a set of 103 consensus statements on vestibular schwannoma care from the 8th Quadrennial International Conference on Vestibular Schwannoma.

Results: Responses from the naive and context-enriched Generative Pretrained Transformer (GPT) models were consistently rated not significantly different in terms of accuracy, coherence, relevance, thoroughness, and overall performance, and they were often rated significantly higher than expert responses. Both the naive and content-enriched GPT models provided faster responses to the standardized question set than expert neurosurgeon respondents (p < 0.01). The context-enriched GPT model agreed with 98 of the 103 (95%) consensus statements. Of interest, all expert surgeons expressed concerns about the reliability of GPT in accurately addressing the nuances and controversies surrounding the management of vestibular schwannoma. Furthermore, the authors developed neuroGPT-X, a chat-based platform designed to provide point-of-care clinical support and mitigate the limitations of human memory. neuroGPT-X incorporates features such as in-text citations and references to enable accurate, relevant, and reliable information in real time.

Conclusions: The present study, with its subspecialist-level performance in generating written responses to complex neurosurgical problems for which evidence-based consensus for management is lacking, suggests that context-enriched LLMs show promise as a point-of-care medical resource. The authors anticipate that this work will be a springboard for expansion into more medical specialties, incorporating evidence-based clinical information and developing expert-level dialogue surrounding LLMs in healthcare.

Keywords: GPT; acoustic schwannoma; large language models; neuroGPT-X; vestibular schwannoma.

MeSH terms

  • Humans
  • Language
  • Medicine*
  • Neuroma, Acoustic* / surgery
  • Neurosurgeons
  • Reproducibility of Results