General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity

Ruifeng Liu; Kyle P Glover; Michael G Feasel; Anders Wallqvist

doi:10.1021/acs.jcim.8b00114

General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity

J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.

Authors

Ruifeng Liu¹, Kyle P Glover², Michael G Feasel³, Anders Wallqvist¹

Affiliations

¹ Department of Defense Biotechnology High Performance Computing Software Applications Institute , Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command , Fort Detrick , Maryland 21702 , United States.
² Defense Threat Reduction Agency , Aberdeen Proving Ground , Maryland 21010 , United States.
³ U.S. Army-Edgewood Chemical Biological Center, Operational Toxicology , Aberdeen Proving Ground , Maryland 21010 , United States.

PMID: 29949366
DOI: 10.1021/acs.jcim.8b00114

Abstract

Key requirements for quantitative structure-activity relationship (QSAR) models to gain acceptance by regulatory authorities include a defined domain of applicability (DA) and appropriate measures of goodness-of-fit, robustness, and predictivity. Hence, many DA metrics have been developed over the past two decades. The most intuitive are perhaps distance-to-model metrics, which are most commonly defined in terms of the mean distance between a molecule and its k nearest training samples. Detailed evaluations have shown that the variance of predictions by an ensemble of QSAR models may serve as a DA metric and can outperform distance-to-model metrics. Intriguingly, the performance of ensemble variance metric has led researchers to conclude that the error of predicting a new molecule does not depend on the input descriptors or machine-learning methods but on its distance to the training molecules. This implies that the distance to training samples may serve as the basis for developing a high-performance DA metric. In this article, we introduce a new Tanimoto distance-based DA metric called the sum of distance-weighted contributions (SDC), which takes into account contributions from all molecules in a training set. Using four acute chemical toxicity data sets of varying sizes and four other molecular property data sets, we demonstrate that SDC correlates well with the prediction error for all data sets regardless of the machine-learning methods and molecular descriptors used to build the QSAR models. Using the acute toxicity data sets, we compared the distribution of prediction errors with respect to SDC, the mean distance to k-nearest training samples, and the variance of random forest predictions. The results showed that the correlation with the prediction error was highest for SDC. We also demonstrate that SDC allows for the development of robust root mean squared error (RMSE) models and makes it possible to not only give a QSAR prediction but also provide an individual RMSE estimate for each molecule. Because SDC does not depend on a specific machine-learning method, it represents a canonical measure that can be widely used to estimate individual molecule prediction errors for any machine-learning method.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Drug Discovery* / methods
Humans
Machine Learning
Models, Statistical
Quantitative Structure-Activity Relationship*
Small Molecule Libraries / chemistry
Small Molecule Libraries / pharmacology
Small Molecule Libraries / toxicity
Uncertainty

Substances

Small Molecule Libraries