Controlled exploration of chemical space by machine learning of coarse-grained representations

Phys Rev E. 2019 Sep;100(3-1):033302. doi: 10.1103/PhysRevE.100.033302.

Abstract

The size of chemical compound space is too large to be probed exhaustively. This leads high-throughput protocols to drastically subsample and results in sparse and nonuniform datasets. Rather than arbitrarily selecting compounds, we systematically explore chemical space according to the target property of interest. We first perform importance sampling by introducing a Markov chain Monte Carlo scheme across compounds. We then train a machine learning (ML) model on the sampled data to expand the region of chemical space probed. Our boosting procedure enhances the number of compounds by a factor 2 to 10, enabled by the ML model's coarse-grained representation, which both simplifies the structure-property relationship and reduces the size of chemical space. The ML model correctly recovers linear relationships between transfer free energies. These linear relationships correspond to features that are global to the dataset, marking the region of chemical space up to which predictions are reliable; this is a more robust alternative to the predictive variance. Bridging coarse-grained simulations with ML gives rise to an unprecedented database of drug-membrane insertion free energies for 1.3 million compounds.