Open Source Bayesian Models. 2. Mining a "Big Dataset" To Create and Validate Models with ChEMBL

J Chem Inf Model. 2015 Jun 22;55(6):1246-60. doi: 10.1021/acs.jcim.5b00144. Epub 2015 Jun 3.


In an associated paper, we have described a reference implementation of Laplacian-corrected naïve Bayesian model building using extended connectivity (ECFP)- and molecular function class fingerprints of maximum diameter 6 (FCFP)-type fingerprints. As a follow-up, we have now undertaken a large-scale validation study in order to ensure that the technique generalizes to a broad variety of drug discovery datasets. To achieve this, we have used the ChEMBL (version 20) database and split it into more than 2000 separate datasets, each of which consists of compounds and measurements with the same target and activity measurement. In order to test these datasets with the two-state Bayesian classification, we developed an automated algorithm for detecting a suitable threshold for active/inactive designation, which we applied to all collections. With these datasets, we were able to establish that our Bayesian model implementation is effective for the large majority of cases, and we were able to quantify the impact of fingerprint folding on the receiver operator curve cross-validation metrics. We were also able to study the impact that the choice of training/testing set partitioning has on the resulting recall rates. The datasets have been made publicly available to be downloaded, along with the corresponding model data files, which can be used in conjunction with the CDK and several mobile apps. We have also explored some novel visualization methods which leverage the structural origins of the ECFP/FCFP fingerprints to attribute regions of a molecule responsible for positive and negative contributions to activity. The ability to score molecules across thousands of relevant datasets across organisms also may help to access desirable and undesirable off-target effects as well as suggest potential targets for compounds derived from phenotypic screens.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Bayes Theorem
  • Data Mining*
  • Databases, Pharmaceutical*
  • Drug Discovery
  • Models, Statistical*
  • Mycobacterium tuberculosis / drug effects
  • Software