Holistic evaluation of large language models for medical tasks with MedHELM.
Bedi S, Cui H, Fuentes M, Unell A, Wornow M, Banda JM, Kotecha N, Keyes T, Mai Y, Oez M, Qiu H, Jain S, Schettini L, Kashyap M, Fries JA, Swaminathan A, Chung P, Haredasht FN, Lopez I, Aali A, Tse G, Nayak A, Vedak S, Jain SS, Patel B, Fayanju O, Shah S, Goh E, Yao DH, Soetikno B, Reis E, Gatidis S, Divi V, Capasso R, Saralkar R, Chiang CC, Jindal J, Pham T, Ghoddusi F, Lin S, Chiou AS, Hong HJ, Roy M, Gensheimer MF, Patel H, Schulman K, Dash D, Char D, Downing L, Grolleau F, Black K, Mieso B, Zahedivash A, Yim WW, Sharma H, Lee T, Kirsch H, Lee J, Ambers N, Lugtu C, Sharma A, Mawji B, Alekseyev A, Zhou V, Kakkar V, Helzer J, Revri A, Bannett Y, Daneshjou R, Chen J, Alsentzer E, Morse K, Ravi N, Aghaeepour N, Kennedy V, Chaudhari A, Wang T, Koyejo S, Lungren MP, Horvitz E, Liang P, Pfeffer MA, Shah NH.
Bedi S, et al. Among authors: aghaeepour n.
Nat Med. 2026 Mar;32(3):943-951. doi: 10.1038/s41591-025-04151-2. Epub 2026 Jan 20.
Nat Med. 2026.
PMID: 41559415