Background: Radiomics shows promising results in supporting the clinical decision process, and much effort has been put into its standardization, thus leading to the Imaging Biomarker Standardization Initiative (IBSI), that established how radiomics features should be computed. However, radiomics still lacks standardization and many factors, such as segmentation methods, limit study reproducibility and robustness.
Aim: We investigated the impact that three different segmentation methods (manual, thresholding and region growing) have on radiomics features extracted from 18F-PSMA-1007 Positron Emission Tomography (PET) images of 78 patients (43 Low Risk, 35 High Risk). Segmentation was repeated for each patient, thus leading to three datasets of segmentations. Then, feature extraction was performed for each dataset, and 1781 features (107 original, 930 Laplacian of Gaussian (LoG) features, 744 wavelet features) were extracted. Feature robustness and reproducibility were assessed through the intra class correlation coefficient (ICC) to measure agreement between the three segmentation methods. To assess the impact that the three methods had on machine learning models, feature selection was performed through a hybrid descriptive-inferential method, and selected features were given as input to three classifiers, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), Random Forest (RF), AdaBoost and Neural Networks (NN), whose performance in discriminating between low-risk and high-risk patients have been validated through 30 times repeated five-fold cross validation.
Conclusions: Our study showed that segmentation methods influence radiomics features and that Shape features were the least reproducible (average ICC: 0.27), while GLCM features the most reproducible. Moreover, feature reproducibility changed depending on segmentation type, resulting in 51.18% of LoG features exhibiting excellent reproducibility (range average ICC: 0.68-0.87) and 47.85% of wavelet features exhibiting poor reproducibility that varied between wavelet sub-bands (range average ICC: 0.34-0.80) and resulted in the LLL band showing the highest average ICC (0.80). Finally, model performance showed that region growing led to the highest accuracy (74.49%), improved sensitivity (84.38%) and AUC (79.20%) in contrast with manual segmentation.
Keywords: 18F-PSMA-1007 PET; machine learning; prostate; radiomics; reproducibility; robustness; segmentation.