Artificial intelligence (AI) adoption in healthcare is rising. Unbiased evaluation requires uncontaminated benchmarks. We evaluated Mistral-7B-Instruct-v0.1 on 1120 human-validated Italian medical multiple-choice questions (SSM). Mistral achieved 40,2% accuracy and 38.8% F1 score on the dataset. Likely causes include English-centric instruction tuning, lack of medical domain knowledge, and prompt misalignment with the task format. These findings suggest that LLMs need further improvements before deployment.