Early diagnosis of skin cancer remains a pressing challenge in dermatological and oncological practice. AI-driven learning models have emerged as powerful tools for automating the classification of skin lesions by using dermoscopic images. This study introduces a novel hybrid deep learning model, Enhanced Vision Transformer (EViT) with Dens169, for the accurate classification of dermoscopic skin lesion images. The proposed architecture integrates EViT with DenseNet169 to leverage both global context and fine-grained local features. The EViT Encoder component includes six attention-based encoder blocks empowered by a multihead self-attention (MHSA) mechanism and Layer Normalization, enabling efficient global spatial understanding. To preserve the local spatial continuity lost during patch segmentation, we introduced a Spatial Detail Enhancement Block (SDEB) comprising three parallel convolutional layers, followed by a fusion layer. These layers reconstruct the edge, boundary, and texture details, which are critical for lesion detection. The DenseNet169 backbone, modified to suit dermoscopic data, extracts local features that complement global attention features. The outputs from EViT and DenseNet169 were flattened and fused via element-wise addition, followed by a Multilayer Perceptron (MLP) and a softmax layer for final classification across seven skin lesion categories. The results on the ISIC 2018 dataset demonstrate that the proposed hybrid model achieves superior performance, with an accuracy of 97.1%, a sensitivity of 90.8%, a specificity of 99.29%, and an AUC of 95.17%, outperforming existing state-of-the-art models. The hybrid EViT-Dens169 model provides a robust solution for early skin cancer detection by efficiently fusing the global and local features.
Keywords: Deep learning; Enhanced DenseNet169; Enhanced ViT-Encoder; Hybrid model; Skin cancer.
© 2025. The Author(s).