Objective: To predict cartilage tumor malignancy from radiographic images combined with readily available non-imaging information based on a vision-language foundation model.
Materials and methods: This single-institution study assembled a dataset of 3336 radiographs from 455 patients with enchondroma or chondrosarcoma that was assembled from two sources: (1) patients with histopathology-confirmed diagnoses of enchondroma or chondrosarcoma, and (2) patients with imaging-stable enchondroma without biopsy, confirmed through long-term imaging follow-up. An adapted vision-language foundation model based on the pre-trained CLIP (Contrastive Language-Image Pretraining) architecture was fine-tuned with our proposed Medical Knowledge Adapters and evaluated using 10-fold patient-level cross-validation to predict cartilage tumor malignancy from plain radiographs and demographic information.
Results and conclusion: Using radiographs alone, the model achieved an Areas Under the receiver operating characteristic Curve (AUC) of 0.91 ± 0.04. Incorporating demographics improved the AUC to 0.94 ± 0.02. Subgroup analysis demonstrated robust generalizability across tumor grades with an AUC of 0.91 ± 0.07 in distinguishing atypical cartilaginous tumors (ACT) previously known as low grade chondrosarcomas, and 0.95 ± 0.02 in differentiating high-grade chondrosarcomas from enchondromas. Within the clinically challenging extremity subgroup (enchondroma vs ACT/LGCS), the model achieved an AUC of 0.79 ± 0.14, reflecting diagnostic difficulty observed in clinical practice. This foundation model-based approach demonstrates strong performance using accessible data sources, offering a non-invasive, cost-efficient, and scalable solution for cartilage tumor assessment in musculoskeletal oncology.
Keywords: Cartilage tumor; Multi-modal deep learning; Musculoskeletal oncology; Vision-language foundation model.
© 2026. The Author(s), under exclusive licence to International Skeletal Society (ISS).