Valutazione one-shot di Mistral7B sul nuovo benchmark EuropeMedQA

Recenti Prog Med. 2025 Oct;116(10):619-620. doi: 10.1701/4573.45804.
[Article in Italian]

Abstract

Artificial intelligence (AI) adoption in healthcare is rising. Unbiased evaluation requires uncontaminated benchmarks. We evaluated Mistral-7B-Instruct-v0.1 on 1120 human-validated Italian medical multiple-choice questions (SSM). Mistral achieved 40,2% accuracy and 38.8% F1 score on the dataset. Likely causes include English-centric instruction tuning, lack of medical domain knowledge, and prompt misalignment with the task format. These findings suggest that LLMs need further improvements before deployment.

Publication types

  • English Abstract

MeSH terms

  • Artificial Intelligence*
  • Benchmarking*
  • Delivery of Health Care*
  • Humans
  • Italy