Dissecting Machine-Learning Prediction of Molecular Activity: Is an Applicability Domain Needed for Quantitative Structure-Activity Relationship Models Based on Deep Neural Networks?

J Chem Inf Model. 2019 Jan 28;59(1):117-126. doi: 10.1021/acs.jcim.8b00348. Epub 2018 Nov 21.


Deep neural networks (DNNs) are the major drivers of recent progress in artificial intelligence. They have emerged as the machine-learning method of choice in solving image and speech recognition problems, and their potential has raised the expectation of similar breakthroughs in other fields of study. In this work, we compared three machine-learning methods-DNN, random forest (a popular conventional method), and variable nearest neighbor (arguably the simplest method)-in their ability to predict the molecular activities of 21 in vivo and in vitro data sets. Surprisingly, the overall performance of the three methods was similar. For molecules with structurally close near neighbors in the training sets, all methods gave reliable predictions, whereas for molecules increasingly dissimilar to the training molecules, all three methods gave progressively poorer predictions. For molecules sharing little to no structural similarity with the training molecules, all three methods gave a nearly constant value-approximately the average activity of all training molecules-as their predictions. The results confirm conclusions deduced from analyzing molecular applicability domains for accurate predictions, i.e., the most important determinant of the accuracy of predicting a molecule is its similarity to the training samples. This highlights the fact that even in the age of deep learning, developing a truly high-quality model relies less on the choice of machine-learning approach and more on the availability of experimental efforts to generate sufficient training data of structurally diverse compounds. The results also indicate that the distance to training molecules offers a natural and intuitive basis for defining applicability domains to flag reliable and unreliable quantitative structure-activity relationship predictions.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Databases, Chemical
  • Drug Evaluation, Preclinical*
  • Machine Learning
  • Models, Molecular*
  • Molecular Structure*
  • Neural Networks, Computer
  • Quantitative Structure-Activity Relationship
  • Workflow