Background: Ovarian cancer is a lethal gynecological malignancy, with ~70% of patients diagnosed at advanced stages (5-year survival <30%) due to suboptimal early detection tools. Current modalities [e.g., carbohydrate antigen 125 (CA125), imaging] lack adequate sensitivity and specificity, while existing machine learning (ML)-based diagnostic models are constrained by small sample sizes and insufficient validation. This study aims to develop and validate a robust ML-based diagnostic model for ovarian cancer using large-scale gene expression data, identify core diagnostic biomarkers, and explore their functional and immune-related mechanisms.
Methods: We analyzed five Gene Expression Omnibus (GEO) datasets, with GSE26712, GSE29156, and GSE40595 assigned as the training cohort, and GSE66957 and GSE119054 as independent validation cohorts. Following batch effect correction, 113 ML algorithms were systematically compared. The optimal model was further validated, and its core genes [model gene (Mgenes)] were subjected to functional enrichment and immune cell correlation analyses.
Results: The least absolute shrinkage and selection operator (Lasso) + NaiveBayes model demonstrated the strongest diagnostic performance, with area under the receiver operating characteristic curve (AUC) values of 0.991 in the training set, 0.889 in GSE119054, and 0.936 in GSE66957, and achieved 100% recall in both validation cohorts. Twelve Mgenes were identified, among which CP exhibited the highest diagnostic value (AUC =0.966). Functional enrichment revealed that Mgenes were predominantly involved in cell cycle regulation and DNA replication pathways, and correlation analyses confirmed their associations with key immune subsets (e.g., MAOB with regulatory T cells, STAR with CD8+ T cells).
Conclusions: The Lasso + NaiveBayes model enables robust ovarian cancer diagnosis, with high recall prioritizing the identification of all potential cases. The identified Mgenes act as both high-performance diagnostic biomarkers and functional mediators of tumorigenesis, laying a foundation for early detection strategies and mechanistic research into ovarian cancer.
Keywords: Ovarian cancer; diagnostic model; machine learning (ML); tumor immune microenvironment.
© AME Publishing Company.