A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data
- PMID: 32512209
- DOI: 10.1016/j.jbi.2020.103465
A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data
Abstract
The problem of imbalanced data classification often exists in medical diagnosis. Traditional classification algorithms usually assume that the number of samples in each class is similar and their misclassification cost during training is equal. However, the misclassification cost of patient samples is higher than that of healthy person samples. Therefore, how to increase the identification of patients without affecting the classification of healthy individuals is an urgent problem. In order to solve the problem of imbalanced data classification in medical diagnosis, we propose a hybrid sampling algorithm called RFMSE, which combines the Misclassification-oriented Synthetic minority over-sampling technique (M-SMOTE) and Edited nearset neighbor (ENN) based on Random forest (RF). The algorithm is mainly composed of three parts. First, M-SMOTE is used to increase the number of samples in the minority class, while the over-sampling rate of M-SMOTE is the misclassification rate of RF. Then, ENN is used to remove the noise ones from the majority samples. Finally, RF is used to perform classification prediction for the samples after hybrid sampling, and the stopping criterion for iterations is determined according to the changes of the classification index (i.e. Matthews Correlation Coefficient (MCC)). When the value of MCC continuously drops, the process of iterations will be stopped. Extensive experiments conducted on ten UCI datasets demonstrate that RFMSE can effectively solve the problem of imbalanced data classification. Compared with traditional algorithms, our method can improve F-value and MCC more effectively.
Keywords: Data resampling; Imbalanced data classification; Medical diagnosis; Random forest.
Copyright © 2020 Elsevier Inc. All rights reserved.
Conflict of interest statement
Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Similar articles
-
A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis.BMC Med Inform Decis Mak. 2022 Dec 29;22(1):344. doi: 10.1186/s12911-022-02075-2. BMC Med Inform Decis Mak. 2022. PMID: 36581862 Free PMC article.
-
Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x. J Cheminform. 2020. PMID: 33372637 Free PMC article.
-
CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests.BMC Bioinformatics. 2017 Mar 14;18(1):169. doi: 10.1186/s12859-017-1578-z. BMC Bioinformatics. 2017. PMID: 28292263 Free PMC article.
-
A comprehensive data level analysis for cancer diagnosis on imbalanced data.J Biomed Inform. 2019 Feb;90:103089. doi: 10.1016/j.jbi.2018.12.003. Epub 2019 Jan 3. J Biomed Inform. 2019. PMID: 30611011 Review.
-
Resampling Techniques for Materials Informatics: Limitations in Crystal Point Groups Classification.J Chem Inf Model. 2022 Aug 8;62(15):3514-3523. doi: 10.1021/acs.jcim.2c00666. Epub 2022 Jul 19. J Chem Inf Model. 2022. PMID: 35852453 Review.
Cited by
-
A hybrid feature selection algorithm combining information gain and grouping particle swarm optimization for cancer diagnosis.PLoS One. 2024 Mar 11;19(3):e0290332. doi: 10.1371/journal.pone.0290332. eCollection 2024. PLoS One. 2024. PMID: 38466662 Free PMC article.
-
A soft voting ensemble learning approach for credit card fraud detection.Heliyon. 2024 Feb 1;10(3):e25466. doi: 10.1016/j.heliyon.2024.e25466. eCollection 2024 Feb 15. Heliyon. 2024. PMID: 38333818 Free PMC article.
-
Predicting Nurse Turnover for Highly Imbalanced Data Using the Synthetic Minority Over-Sampling Technique and Machine Learning Algorithms.Healthcare (Basel). 2023 Dec 15;11(24):3173. doi: 10.3390/healthcare11243173. Healthcare (Basel). 2023. PMID: 38132063 Free PMC article.
-
DeepRTAlign: toward accurate retention time alignment for large cohort mass spectrometry data analysis.Nat Commun. 2023 Dec 11;14(1):8188. doi: 10.1038/s41467-023-43909-5. Nat Commun. 2023. PMID: 38081814 Free PMC article.
-
Using Proteomics Data to Identify Personalized Treatments in Multiple Myeloma: A Machine Learning Approach.Int J Mol Sci. 2023 Oct 25;24(21):15570. doi: 10.3390/ijms242115570. Int J Mol Sci. 2023. PMID: 37958554 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
