Building Quantitative Structure-Activity Relationship Models Using Bayesian Additive Regression Trees

J Chem Inf Model. 2019 Jun 24;59(6):2642-2655. doi: 10.1021/acs.jcim.9b00094. Epub 2019 May 6.

Abstract

Quantitative structure-activity relationship (QSAR) is a very commonly used technique for predicting the biological activity of a molecule using information contained in the molecular descriptors. The large number of compounds and descriptors and the sparseness of descriptors pose important challenges to traditional statistical methods and machine learning (ML) algorithms (such as random forest (RF)) used in this field. Recently, Bayesian Additive Regression Trees (BART), a flexible Bayesian nonparametric regression approach, has been demonstrated to be competitive with widely used ML approaches. Instead of only focusing on accurate point estimation, BART is formulated entirely in a hierarchical Bayesian modeling framework, allowing one to also quantify uncertainties and hence to provide both point and interval estimation for a variety of quantities of interest. We studied BART as a model builder for QSAR and demonstrated that the approach tends to have predictive performance comparable to RF. More importantly, we investigated BART's natural capability to analyze truncated (or qualified) data, generate interval estimates for molecular activities as well as descriptor importance, and conduct model diagnosis, which could not be easily handled through other approaches.

MeSH terms

  • Algorithms
  • Bayes Theorem
  • Drug Discovery / methods*
  • Machine Learning
  • Models, Chemical
  • Pharmaceutical Preparations / chemistry
  • Quantitative Structure-Activity Relationship*
  • Regression Analysis
  • Small Molecule Libraries / chemistry

Substances

  • Pharmaceutical Preparations
  • Small Molecule Libraries