Introduction: Large language models perform well on a range of academic tasks including medical examinations. The performance of this class of models in psychopharmacology has not been explored.
Method: Chat GPT-plus, implementing the GPT-4 large language model, was presented with each of 10 previously-studied antidepressant prescribing vignettes in randomized order, with results regenerated 5 times to evaluate stability of responses. Results were compared to expert consensus.
Results: At least one of the optimal medication choices was included among the best choices in 38/50 (76%) vignettes: 5/5 for 7 vignettes, 3/5 for 1, and 0/5 for 2. At least one of the poor choice or contraindicated medications was included among the choices considered optimal or good in 24/50 (48%) of vignettes. The model provided as rationale for treatment selection multiple heuristics including avoiding prior unsuccessful medications, avoiding adverse effects based on comorbidities, and generalizing within medication class.
Conclusion: The model appeared to identify and apply a number of heuristics commonly applied in psychopharmacologic clinical practice. However, the inclusion of less optimal recommendations indicates that large language models may pose a substantial risk if routinely applied to guide psychopharmacologic treatment without further monitoring.