Missing data imputation using statistical and machine learning methods in a real breast cancer problem
- PMID: 20638252
- DOI: 10.1016/j.artmed.2010.05.002
Missing data imputation using statistical and machine learning methods in a real breast cancer problem
Abstract
Objectives: Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set.
Materials and methods: Imputation methods based on statistical techniques, e.g., mean, hot-deck and multiple imputation, and machine learning techniques, e.g., multi-layer perceptron (MLP), self-organisation maps (SOM) and k-nearest neighbour (KNN), were applied to data collected through the "El Álamo-I" project, and the results were then compared to those obtained from the listwise deletion (LD) imputation method. The database includes demographic, therapeutic and recurrence-survival information from 3679 women with operable invasive breast cancer diagnosed in 32 different hospitals belonging to the Spanish Breast Cancer Research Group (GEICAM). The accuracies of predictions on early cancer relapse were measured using artificial neural networks (ANNs), in which different ANNs were estimated using the data sets with imputed missing values.
Results: The imputation methods based on machine learning algorithms outperformed imputation statistical methods in the prediction of patient outcome. Friedman's test revealed a significant difference (p=0.0091) in the observed area under the ROC curve (AUC) values, and the pairwise comparison test showed that the AUCs for MLP, KNN and SOM were significantly higher (p=0.0053, p=0.0048 and p=0.0071, respectively) than the AUC from the LD-based prognosis model.
Conclusion: The methods based on machine learning techniques were the most suited for the imputation of missing values and led to a significant enhancement of prognosis accuracy compared to imputation methods based on statistical procedures.
Copyright © 2010 Elsevier B.V. All rights reserved.
Similar articles
-
Mixture classification model based on clinical markers for breast cancer prognosis.Artif Intell Med. 2010 Feb-Mar;48(2-3):129-37. doi: 10.1016/j.artmed.2009.07.008. Epub 2009 Dec 14. Artif Intell Med. 2010. PMID: 20005686
-
Impact of censoring on learning Bayesian networks in survival modelling.Artif Intell Med. 2009 Nov;47(3):199-217. doi: 10.1016/j.artmed.2009.08.001. Epub 2009 Oct 14. Artif Intell Med. 2009. PMID: 19833488
-
Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data.Bioinformatics. 2005 May 15;21(10):2417-23. doi: 10.1093/bioinformatics/bti345. Epub 2005 Feb 24. Bioinformatics. 2005. PMID: 15731210
-
Missing value imputation for gene expression data: computational techniques to recover missing data from available information.Brief Bioinform. 2011 Sep;12(5):498-513. doi: 10.1093/bib/bbq080. Epub 2010 Dec 14. Brief Bioinform. 2011. PMID: 21156727 Review.
-
Multiple imputation in health-care databases: an overview and some applications.Stat Med. 1991 Apr;10(4):585-98. doi: 10.1002/sim.4780100410. Stat Med. 1991. PMID: 2057657 Review.
Cited by
-
Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets.BMC Med Res Methodol. 2024 Feb 16;24(1):41. doi: 10.1186/s12874-024-02173-x. BMC Med Res Methodol. 2024. PMID: 38365610 Free PMC article.
-
Machine Learning and Health Science Research: Tutorial.J Med Internet Res. 2024 Jan 30;26:e50890. doi: 10.2196/50890. J Med Internet Res. 2024. PMID: 38289657 Free PMC article.
-
Secure and privacy-preserving automated machine learning operations into end-to-end integrated IoT-edge-artificial intelligence-blockchain monitoring system for diabetes mellitus prediction.Comput Struct Biotechnol J. 2023 Nov 23;23:212-233. doi: 10.1016/j.csbj.2023.11.038. eCollection 2024 Dec. Comput Struct Biotechnol J. 2023. PMID: 38169966 Free PMC article.
-
Prediction of cell migration potential on human breast cancer cells treated with Albizia lebbeck ethanolic extract using extreme machine learning.Sci Rep. 2023 Dec 14;13(1):22242. doi: 10.1038/s41598-023-49363-z. Sci Rep. 2023. PMID: 38097683 Free PMC article.
-
Missing data imputation, prediction, and feature selection in diagnosis of vaginal prolapse.BMC Med Res Methodol. 2023 Nov 6;23(1):259. doi: 10.1186/s12874-023-02079-0. BMC Med Res Methodol. 2023. PMID: 37932660 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical
Research Materials
