Distinguishing infiltrative basal cell carcinoma (BCC) from poorly differentiated cutaneous squamous cell carcinoma (cSCC) remains a significant histopathological challenge. Automated deep learning approaches hold promise for improving diagnostic reliability, yet robust external validation is essential. In this study, we developed a weakly supervised deep learning model to classify these diagnostically challenging subtypes and evaluated its generalizability across internal and external cohorts, as well as in comparison to a dermatopathology foundation model (HistoGPT). The model employed a multiple-instance learning framework (CLAM) using the histopathology-specific transformer Phikon for feature extraction from whole-slide images. Slide-level ground-truth diagnoses from the collected images (n = 335, University Hospital Erlangen) were derived from routine clinical practice and re-evaluated by two board-certified dermatopathologists. Performance was assessed on an internal test set of 84 whole-slide images (27 cSCC and 57 BCC) and two external datasets: Queensland cohort (n = 10, curated in-distribution cases) and the COBRA cohort (n = 200, broad, partly out-of-distribution cases). Model discrimination was quantified using ROC curves, while accuracy, sensitivity, and specificity were reported alongside 95% Wilson confidence intervals (CIs). On the internal test set, the model achieved perfect classification [area under the receiver operating characteristic (AUC) = 1.0; 100% accuracy, sensitivity, and specificity]. Similarly, strong performance was observed in the Queensland cohort (AUC = 1.0), although limited by sample size. In the more heterogeneous COBRA cohort, discrimination remained high (AUC = 0.923, 95% CI 0.885-0.961), requiring threshold adjustment to correct for marked calibration shift (balanced accuracy 86.5% at Youden's J). Attention heatmaps highlighted histologically meaningful regions. In zero-shot evaluation on the internal test set, HistoGPT achieved an overall accuracy of 77%, with high class-wise sensitivity for BCC (98%, 95% CI 91-100) but markedly reduced sensitivity for cSCC (33%, 95% CI 19-52). Fine-tuning a task-specific classifier on the HistoGPT backbone substantially improved performance, achieving near-perfect discrimination and 98% balanced accuracy. These findings demonstrate that weakly supervised deep learning enables highly accurate classification of diagnostically challenging BCC and cutaneous squamous cell carcinoma subtypes. However, reliable deployment across institutions necessitates careful calibration and domain adaptation, and even powerful foundation models such as HistoGPT benefit from targeted fine-tuning to ensure robust performance in dermatopathology.
Keywords: artificial intelligence; basal cell carcinoma; clinical pathology; computer‐assisted image interpretation; deep learning; skin neoplasms; squamous cell carcinoma.
© 2026 The Author(s). The Journal of Pathology: Clinical Research published by The Pathological Society of Great Britain and Ireland and John Wiley & Sons Ltd.