Holistic evaluation of large language models for medical tasks with MedHELM.
Bedi S, Cui H, Fuentes M, Unell A, Wornow M, Banda JM, Kotecha N, Keyes T, Mai Y, Oez M, Qiu H, Jain S, Schettini L, Kashyap M, Fries JA, Swaminathan A, Chung P, Haredasht FN, Lopez I, Aali A, Tse G, Nayak A, Vedak S, Jain SS, Patel B, Fayanju O, Shah S, Goh E, Yao DH, Soetikno B, Reis E, Gatidis S, Divi V, Capasso R, Saralkar R, Chiang CC, Jindal J, Pham T, Ghoddusi F, Lin S, Chiou AS, Hong HJ, Roy M, Gensheimer MF, Patel H, Schulman K, Dash D, Char D, Downing L, Grolleau F, Black K, Mieso B, Zahedivash A, Yim WW, Sharma H, Lee T, Kirsch H, Lee J, Ambers N, Lugtu C, Sharma A, Mawji B, Alekseyev A, Zhou V, Kakkar V, Helzer J, Revri A, Bannett Y, Daneshjou R, Chen J, Alsentzer E, Morse K, Ravi N, Aghaeepour N, Kennedy V, Chaudhari A, Wang T, Koyejo S, Lungren MP, Horvitz E, Liang P, Pfeffer MA, Shah NH.
Bedi S, et al. Among authors: shah nh, shah s.
Nat Med. 2026 Jan 20. doi: 10.1038/s41591-025-04151-2. Online ahead of print.
Nat Med. 2026.
PMID: 41559415