Background: Large language models (LLMs) such as GPT-4 can interpret free text, but unreliable answers, opaque reasoning, and privacy risks limit their use in healthcare. In contrast, rule-based artificial intelligence (AI) provides transparent and reproducible results but struggles with free text. We aimed to combine the strengths of both approaches to test whether such a hybrid system can autonomously and reliably extract clinical data from diagnostic imaging reports.
Methods: We developed a neuro-symbolic AI that connects GPT-4 with a rule-based expert system through a semantic integration platform. GPT-4 extracted candidate facts from free-text reports, while the expert system verified them against medical rules, producing traceable, deterministic labels. We evaluated the system on 206 consecutive prostate cancer PET/CT scan reports, requiring extraction of 26 clinical parameters per report, generating 5356 data points, and answering three study questions: study inclusion, recurrent cancer identification, and prostate-specific antigen (PSA) level retrieval. Outputs were compared against physician-derived references, and discrepancies were reviewed by a blinded adjudicator.
Results: Here we show that neuro-symbolic AI outperforms GPT-4 alone and matches physicians in structuring and analysing reports. GPT-4 alone achieves F1 scores of 0.63 for study inclusion and 0.95 for recurrence detection, with 96.6% correct PSA values. Physicians reach F1 scores of 1.00 and 0.99, with 98.1% PSA accuracy. The neuro-symbolic AI scores twice 1.00 with 100% PSA accuracy and delivers always an auditable chain of reasoning. It intercepts two intentionally introduced reports with residual identifiers, preventing unintended transfer of sensitive data.
Conclusions: Unlike standalone LLMs, neuro-symbolic AI can safely automate data extraction for clinical research and may provide a path toward trustworthy AI in healthcare practice.
Medical doctors often write reports as free text, which is hard to reuse for research or care. A large language model is software that reads and writes text by imitating large networks of brain cells. This type of artificial intelligence can extract and organize important information from medical reports. But its reasoning is opaque, answers can be wrong, and it raises privacy concerns. Rule-based artificial intelligence is transparent, responds correctly, and is privacy-protecting but struggles with free text. We combined both artificial intelligence types, so each offsets the other’s weaknesses. We tested the system on 206 prostate cancer imaging reports, where it extracted information correctly, showed how it reached its answers, and protected sensitive data. Pairing large language models with rule-based systems could make artificial intelligence safer, more trustworthy, and more useful in healthcare.
© 2025. The Author(s).