Large Language Models for Cardiac MRI Diagnosis Based on Standardized Text Descriptions

J Magn Reson Imaging. 2026 Jul;64(1):153-165. doi: 10.1002/jmri.70327. Epub 2026 Apr 23.

Abstract

Background: MRI is important for cardiac disease evaluation, but accurate diagnosis remains challenging in less experienced centers. Although large language models (LLMs) have shown promise in medical imaging diagnosis, their application in cardiac MRI is limited.

Hypothesis: LLMs may be effective in achieving cardiac MRI diagnosis based on standardized descriptions.

Study type: Retrospective.

Population: A total of 203 hypertrophic cardiomyopathy, 186 dilated cardiomyopathy, 46 hypertensive heart disease, 198 ischemic cardiomyopathy, 38 constrictive pericarditis, 45 cardiac amyloidosis, 91 myocarditis, and 144 normal controls.

Field strength/sequences: Balanced steady-state free-precession, short tau inversion recovery, and breath-hold inversion-recovery segmented gradient-echo sequences at 3.0 T.

Assessment: Clinical and cardiac MRI information from each subject was converted into standardized descriptions and input into Generative Pre-trained Transformer-4.5 (GPT-4.5), GPT-4 Omni (GPT-4o), Deepseek-V3, and Deepseek-R1 LLMs. Cardiac MRI information included LV function, wall thickness and motion, and abnormalities on T2WI, perfusion and late gadolinium enhancement sequences. Each model was asked to generate an imaging diagnosis. In addition, a medical student (8 months experience) and three radiologists (junior, mid-level and senior: with 3, 6, and 10 years' experience, respectively) provided diagnoses based on cardiac MRI images and clinical information.

Statistic tests: Frequency-weighted sensitivity and specificity were calculated. The diagnostic performances of the LLMs and human readers were compared using the McNemar test with Bonferroni correction. A p value < 0.05 was considered significant.

Results: All LLMs showed excellent frequency-weighted specificity (0.973-0.983). The frequency-weighted sensitivities of all LLMs were not significantly different from that of the junior radiologist, were significantly higher than that of the medical student, and significantly inferior to those of the senior radiologist (GPT-4.5: 0.863, GPT-4o: 0.821, Deepseek-V3: 0.843, and Deepseek-R1: 0.851 vs. junior radiologist: 0.850, all adjusted p = 1.000; vs. medical student: 0.731, all adjusted p < 0.001; vs. senior radiologist: 0.942, all adjusted p < 0.001). Additionally, the mid-level radiologist achieved a frequency-weighted sensitivity of 0.895, outperforming all LLMs except GPT-4.5.

Data conclusion: LLMs may generate accurate diagnoses from standardized cardiac MRI descriptions, potentially benefiting less experienced physicians.

Technical efficacy: Stage 5.

Keywords: cardiac MRI; ischemic cardiomyopathy; large language models; myocarditis; non‐ischemic cardiomyopathy.

Plain language summary

Reading cardiac MRI scans can be difficult, especially for doctors who have less clinical experience. New artificial intelligence tools called large language models (LLMs) may help support this process. In our study, we found that several LLMs were able to make diagnostic judgments at a level similar to junior radiologists. This suggests that LLMs could be used as supportive tools to provide an initial interpretation of cardiac MRI examinations. Such assistance may help improve diagnostic efficiency, reduce workload, and promote more consistent early evaluation, particularly in settings where experienced cardiac imaging specialists are not readily available.

MeSH terms

  • Adult
  • Aged
  • Cardiomyopathy, Hypertrophic / diagnostic imaging
  • Female
  • Heart Diseases* / diagnostic imaging
  • Heart* / diagnostic imaging
  • Humans
  • Image Interpretation, Computer-Assisted* / methods
  • Large Language Models*
  • Magnetic Resonance Imaging* / methods
  • Male
  • Middle Aged
  • Reproducibility of Results
  • Retrospective Studies
  • Sensitivity and Specificity