Leveraging AI Large Language Models for Writing Clinical Trial Proposals in Dermatology: Instrument Validation Study

JMIR Dermatol. 2026 Jan 12:9:e76674. doi: 10.2196/76674.

Abstract

Background: Large language models (LLMs) are becoming increasingly popular in clinical trial design but have been underused in research proposal development.

Objective: This study compared the performance of commonly used open access LLMs versus human proposal composition and review.

Methods: A total of 10 LLMs were prompted to write a research proposal. Six physicians and each of the LLMs assessed 11 blinded proposals for capabilities and limitations in accuracy and comprehensiveness.

Results: ChatGPT-o1 and Llama 3.1 were rated the most and least accurate, respectively, by human scorers. LLM scorers rated ChatGPT-o1 and DeepSeek R1 as the most accurate. ChatGPT-o1 and Llama 3.1 were rated as the most and least comprehensive, respectively, by human and LLM scorers. LLMs performed poorly on scoring proposals and, on average, rated proposals 1.9 points higher than humans for both accuracy and comprehensiveness.

Conclusions: Paid versions of ChatGPT remain the highest-quality and most versatile option of the available LLMs. These tools cannot replace expert input but serve as powerful assistants, streamlining the development process and enhancing productivity.

Keywords: AI; artificial intelligence; clinical research; clinical trials; deep learning; large language model; machine learning; research design; research proposal.

Publication types

  • Validation Study

MeSH terms

  • Artificial Intelligence*
  • Clinical Trials as Topic* / methods
  • Dermatology*
  • Humans
  • Language
  • Large Language Models
  • Research Design*
  • Writing*