Random forest: a classification and regression tool for compound classification and QSAR modeling
- PMID: 14632445
- DOI: 10.1021/ci034160g
Random forest: a classification and regression tool for compound classification and QSAR modeling
Abstract
A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
Similar articles
-
Boosting: an ensemble learning tool for compound classification and QSAR modeling.J Chem Inf Model. 2005 May-Jun;45(3):786-99. doi: 10.1021/ci0500379. J Chem Inf Model. 2005. PMID: 15921468
-
Application of the random forest method in studies of local lymph node assay based skin sensitization data.J Chem Inf Model. 2005 Jul-Aug;45(4):952-64. doi: 10.1021/ci050049u. J Chem Inf Model. 2005. PMID: 16045289
-
Contemporary QSAR classifiers compared.J Chem Inf Model. 2007 Jan-Feb;47(1):219-27. doi: 10.1021/ci600332j. J Chem Inf Model. 2007. PMID: 17238267
-
Three useful dimensions for domain applicability in QSAR models using random forest.J Chem Inf Model. 2012 Mar 26;52(3):814-23. doi: 10.1021/ci300004n. Epub 2012 Mar 9. J Chem Inf Model. 2012. PMID: 22385389
-
Development of linear, ensemble, and nonlinear models for the prediction and interpretation of the biological activity of a set of PDGFR inhibitors.J Chem Inf Comput Sci. 2004 Nov-Dec;44(6):2179-89. doi: 10.1021/ci049849f. J Chem Inf Comput Sci. 2004. PMID: 15554688
Cited by
-
Benchmarking ligand-based virtual High-Throughput Screening with the PubChem database.Molecules. 2013 Jan 8;18(1):735-56. doi: 10.3390/molecules18010735. Molecules. 2013. PMID: 23299552 Free PMC article.
-
Integrative analyses of immune-related biomarkers and associated mechanisms in coronary heart disease.BMC Med Genomics. 2022 Oct 20;15(1):219. doi: 10.1186/s12920-022-01375-w. BMC Med Genomics. 2022. PMID: 36266609 Free PMC article.
-
Development and validation of smartwatch-based activity recognition models for rigging crew workers on cable logging operations.PLoS One. 2021 May 12;16(5):e0250624. doi: 10.1371/journal.pone.0250624. eCollection 2021. PLoS One. 2021. PMID: 33979355 Free PMC article.
-
The parameter sensitivity of random forests.BMC Bioinformatics. 2016 Sep 1;17(1):331. doi: 10.1186/s12859-016-1228-x. BMC Bioinformatics. 2016. PMID: 27586051 Free PMC article.
-
Predicting total, abdominal, visceral and hepatic adiposity with circulating biomarkers in Caucasian and Japanese American women.PLoS One. 2012;7(8):e43502. doi: 10.1371/journal.pone.0043502. Epub 2012 Aug 17. PLoS One. 2012. PMID: 22912885 Free PMC article.
LinkOut - more resources
Full Text Sources
Other Literature Sources
