A Comparative Study on the Use of DeepSeek-R1 and ChatGPT-4.5 in Different Aspects of Plastic Surgery

Aesthetic Plast Surg. 2025 Aug 11. doi: 10.1007/s00266-025-05108-z. Online ahead of print.

Abstract

Background: Artificial intelligence (AI) has the potential to enhance medical practice, but its application in plastic surgery remains underexplored. DeepSeek-R1 and ChatGPT-4.5 are AI models that can assist with clinical tasks, but their performance in plastic surgery-related queries needs evaluation. This study compares the two models in providing clinically relevant, detailed, and accurate responses.

Objective: The objective of this study is to evaluate and compare the performance of DeepSeek-R1 and ChatGPT-4.5 across 10 plastic surgery-related tasks, focusing on accuracy, detail, and clinical relevance.

Methods: This comparative evaluation was conducted by having two senior plastic surgeons review the AI-generated responses for each task. The responses were rated on a 1-10 scale based on their accuracy, completeness, and clinical relevance. The tasks involved both general knowledge questions and more complex, clinically relevant tasks such as medical history notes and hospital admission/discharge slips. After scoring, the mean and standard deviation (SD) were calculated for each model to evaluate their overall performance and consistency.

Results: The results revealed that DeepSeek-R1 consistently outperformed ChatGPT-4.5 across all tasks, with higher average scores for both evaluators. DeepSeek-R1 excelled in tasks requiring high clinical detail, comprehensive explanations, and professional-level accuracy, particularly in tasks involving botulinum toxin, medical documentation, and novel research topics. In contrast, ChatGPT-4.5 was rated higher for tasks requiring concise responses, providing accurate but less detailed overviews. The mean scores for DeepSeek-R1 were significantly higher, with lower standard deviations, indicating greater consistency in its responses. ChatGPT-4.5, though performing well for general inquiries, showed more variability and scored lower in complex clinical tasks.

Conclusion: DeepSeek-R1 is better suited for tasks needing clinical detail and professional-level accuracy, while ChatGPT-4.5 excels in providing quick, concise responses. Both models show promise in supporting plastic surgery practice and education, but should complement, not replace, human expertise.

Level of evidence v: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .

Keywords: AI in healthcare; Artificial intelligence; Botulinum toxin; ChatGPT-4.5; Clinical decision support; DeepSeek-R1; Medical documentation; Plastic surgery.