Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric
- PMID: 28574989
- PMCID: PMC5456046
- DOI: 10.1371/journal.pone.0177678
Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric
Abstract
Data imbalance is frequently encountered in biomedical applications. Resampling techniques can be used in binary classification to tackle this issue. However such solutions are not desired when the number of samples in the small class is limited. Moreover the use of inadequate performance metrics, such as accuracy, lead to poor generalization results because the classifiers tend to predict the largest size class. One of the good approaches to deal with this issue is to optimize performance metrics that are designed to handle data imbalance. Matthews Correlation Coefficient (MCC) is widely used in Bioinformatics as a performance metric. We are interested in developing a new classifier based on the MCC metric to handle imbalanced data. We derive an optimal Bayes classifier for the MCC metric using an approach based on Frechet derivative. We show that the proposed algorithm has the nice theoretical property of consistency. Using simulated data, we verify the correctness of our optimality result by searching in the space of all possible binary classifiers. The proposed classifier is evaluated on 64 datasets from a wide range data imbalance. We compare both classification performance and CPU efficiency for three classifiers: 1) the proposed algorithm (MCC-classifier), the Bayes classifier with a default threshold (MCC-base) and imbalanced SVM (SVM-imba). The experimental evaluation shows that MCC-classifier has a close performance to SVM-imba while being simpler and more efficient.
Conflict of interest statement
Figures
Similar articles
-
Class-imbalanced classifiers for high-dimensional data.Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9. Brief Bioinform. 2013. PMID: 22408190 Review.
-
Efficient Selection of Gaussian Kernel SVM Parameters for Imbalanced Data.Genes (Basel). 2023 Feb 25;14(3):583. doi: 10.3390/genes14030583. Genes (Basel). 2023. PMID: 36980852 Free PMC article.
-
Improving classification of mature microRNA by solving class imbalance problem.Sci Rep. 2016 May 16;6:25941. doi: 10.1038/srep25941. Sci Rep. 2016. PMID: 27181057 Free PMC article.
-
Automatic feed phase identification in multivariate bioprocess profiles by sequential binary classification.Anal Chim Acta. 2017 Aug 22;982:48-61. doi: 10.1016/j.aca.2017.05.034. Epub 2017 Jun 22. Anal Chim Acta. 2017. PMID: 28734365
-
Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets.J Theor Biol. 2017 Dec 21;435:208-217. doi: 10.1016/j.jtbi.2017.09.018. Epub 2017 Sep 20. J Theor Biol. 2017. PMID: 28941868 Review.
Cited by
-
Modeling High Energy Molecules and Screening to Find Novel High Energy Candidates.ACS Omega. 2024 Oct 11;9(42):42709-42720. doi: 10.1021/acsomega.4c01070. eCollection 2024 Oct 22. ACS Omega. 2024. PMID: 39464471 Free PMC article.
-
Deep learning estimation of northern hemisphere soil freeze-thaw dynamics using satellite multi-frequency microwave brightness temperature observations.Front Big Data. 2023 Nov 17;6:1243559. doi: 10.3389/fdata.2023.1243559. eCollection 2023. Front Big Data. 2023. PMID: 38045095 Free PMC article.
-
A Framework for Enhancing Stock Investment Performance by Predicting Important Trading Points with Return-Adaptive Piecewise Linear Representation and Batch Attention Multi-Scale Convolutional Recurrent Neural Network.Entropy (Basel). 2023 Oct 30;25(11):1500. doi: 10.3390/e25111500. Entropy (Basel). 2023. PMID: 37998192 Free PMC article.
-
Associating brain imaging phenotypes and genetic risk factors via a hypergraph based netNMF method.Front Aging Neurosci. 2023 Mar 2;15:1052783. doi: 10.3389/fnagi.2023.1052783. eCollection 2023. Front Aging Neurosci. 2023. PMID: 36936501 Free PMC article.
-
Multi-cancer samples clustering via graph regularized low-rank representation method under sparse and symmetric constraints.BMC Bioinformatics. 2019 Dec 30;20(Suppl 22):718. doi: 10.1186/s12859-019-3231-5. BMC Bioinformatics. 2019. PMID: 31888442 Free PMC article.
References
-
- Daskalaki S, Kopanas I, Avouris N. Evaluation of classifiers for an uneven class distribution problem. Applied artificial intelligence. 2006;20(5):381–417. 10.1080/08839510500313653 - DOI
-
- He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. John Wiley & Sons; 2013.
-
- Menon A, Narasimhan H, Agarwal S, Chawla S. On the Statistical Consistency of Algorithms for Binary Classification under Class Imbalance. In: Dasgupta S, Mcallester D, editors. Proceedings of the 30th International Conference on Machine Learning (ICML-13). vol. 28. JMLR Workshop and Conference Proceedings; 2013. p. 603–611.
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
