Large language models accurately extract aortic information from abdominal imaging reports in a large, real-world database

J Vasc Surg. 2025 Nov 5:S0741-5214(25)01885-3. doi: 10.1016/j.jvs.2025.10.044. Online ahead of print.

Abstract

Objective: Maintaining robust surveillance programs for abdominal aortic aneurysms (AAAs) is important, but these programs are expensive and labor-intensive, typically requiring manual data review by trained health care professionals. Studies have shown that natural language processing software can assist in these functions, but each task-specific algorithm requires human-directed training before use. Our objective was to evaluate the use of a large language model (LLM) to extract AAA-related data using generalized artificial intelligence, negating the need for task-based training.

Methods: This study examined ultrasound and cross-sectional (computed tomography [CT], magnetic resonance imaging) abdominal imaging reports randomly selected for human review from a prospectively maintained AAA surveillance registry within an integrated health system from 2008 to 2024. Llama 3.3 70B (Meta) was utilized on a local Ollama server behind health system firewalls without external telemetry. The model extracted the maximal abdominal aortic diameter from each radiology report, or, if no diameter was stated, interpreted descriptive terms to determine whether a report was positive/negative for AAA, or if the aneurysmal status could not be determined (unknown). These data were compared with results extracted by independent expert reviewers for standard machine learning metrics of accuracy, sensitivity (recall), positive predictive value (precision), and F1-score (harmonic mean of precision and recall).

Results: There were 16,331 human expert-reviewed abdominal imaging reports for 11,799 patients included in the LLM analysis. Of these studies, 6102 were ultrasounds (37.4%), and 10,229 were cross-sectional studies (62.6%). The cross-sectional imaging studies included CT (81.9%), magnetic resonance imaging (12.5%) and positron emission tomography-CT (5.6%). The overall accuracy, sensitivity (recall), positive predictive value (precision), and F1-score for the assigned task were 0.93, 0.96, 0.96, and 0.96, respectively. A discrete abdominal aortic diameter was present in 8478 imaging reports (51.9%). At maximal AAA diameters between 3 and 7 cm, model F1 score was 0.97 to 0.99.

Conclusions: LLM extraction of aortic information from abdominal imaging reports is exceptionally reliable without the need for additional human-directed training. In general, LLMs allow for flexible and efficient data mining with minimal human effort. Many LLMs are publicly available and incur no processing costs, making them easily accessible and cost-effective tools to decrease the administrative burden of running complex AAA surveillance registries, with the added opportunity to improve the quality and efficiency of clinical research in the field.

Keywords: Abdominal aortic aneurysm; Aneurysm surveillance; Large language model.