Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 25;4(7):eaap7885.
doi: 10.1126/sciadv.aap7885. eCollection 2018 Jul.

Deep Reinforcement Learning for De Novo Drug Design

Free PMC article

Deep Reinforcement Learning for De Novo Drug Design

Mariya Popova et al. Sci Adv. .
Free PMC article


We have devised and implemented a novel computational strategy for de novo design of molecules with desired properties termed ReLeaSE (Reinforcement Learning for Structural Evolution). On the basis of deep and reinforcement learning (RL) approaches, ReLeaSE integrates two deep neural networks-generative and predictive-that are trained separately but are used jointly to generate novel targeted chemical libraries. ReLeaSE uses simple representation of molecules by their simplified molecular-input line-entry system (SMILES) strings only. Generative models are trained with a stack-augmented memory network to produce chemically feasible SMILES strings, and predictive models are derived to forecast the desired properties of the de novo-generated compounds. In the first phase of the method, generative and predictive models are trained separately with a supervised learning algorithm. In the second phase, both models are trained jointly with the RL approach to bias the generation of new chemical structures toward those with the desired physical and/or biological properties. In the proof-of-concept study, we have used the ReLeaSE method to design chemical libraries with a bias toward structural complexity or toward compounds with maximal, minimal, or specific range of physical properties, such as melting point or hydrophobicity, or toward compounds with inhibitory activity against Janus protein kinase 2. The approach proposed herein can find a general use for generating targeted chemical libraries of novel compounds optimized for either a single desired property or multiple properties.


Fig. 1
Fig. 1. The workflow of deep RL algorithm for generating new SMILES strings of compounds with the desired properties.
(A) Training step of the generative Stack-RNN. (B) Generator step of the generative Stack-RNN. During training, the input token is a character in the currently processed SMILES string from the training set. The model outputs the probability vector pΘ(at|st − 1) of the next character given a prefix. Vector of parameters Θ is optimized by cross-entropy loss function minimization. In the generator regime, the input token is a previously generated character. Next, character at is sampled randomly from the distribution pΘ(at| st − 1). (C) General pipeline of RL system for novel compound generation. (D) Scheme of predictive model. This model takes a SMILES string as an input and provides one real number, which is an estimated property value, as an output. Parameters of the model are trained by l2-squared loss function minimization.
Fig. 2
Fig. 2. A sample of molecules produced by the generative model.
Fig. 3
Fig. 3. Performance of the generative model G, with and without stack-augmented memory.
(A) Internal diversity of generated libraries. (B) Similarity of the generated libraries to the training data set from the ChEMBL database.
Fig. 4
Fig. 4. Property distributions for RL-optimized versus baseline generator model.
(A) Melting temperature. (B) JAK2 inhibition. (C) Partition coefficient. (D) Number of benzene rings. (E) Number of substituents.
Fig. 5
Fig. 5. Evolution of generated structures as chemical substructure reward increases.
(A) Reward proportional to the total number of small group substituents. (B) Reward proportional to the number of benzene rings.
Fig. 6
Fig. 6. Examples of Stack-RNN cells with interpretable gate activations.
Color coding corresponds to GRU cells with hyperbolic tangent tanh activation function, where dark blue corresponds to the activation function value of −1 and red describes the value of the activation function of 1; the numbers in the range between −1 and 1 are colored using a cool-warm color map.
Fig. 7
Fig. 7. Clustering of generated molecules by t-SNE.
Molecules are colored on the basis of the predicted properties by the predictive model P, with values shown by the color bar on the right. (A and C) Examples of the generated molecules randomly picked from matches with ZINC database and property values predicted by the predictive model P. (A) Partition coefficient, logP. (B) Melting temperature, Tm (°C); examples show generated molecules with lowest and highest predicted Tm. (C) JAK2 inhibition, predicted pIC50.

Similar articles

See all similar articles

Cited by 24 articles

See all "Cited by" articles


    1. Gil Y., Greaves M., Hendler J., Hirsh H., Amplify scientific discovery with artificial intelligence. Science 346, 171–172 (2014). - PubMed
    1. Krittanawong C., Zhang H., Wang Z., Aydar M., Kitai T., Artificial intelligence in precision cardiovascular medicine. J. Am. Coll. Cardiol. 69, 2657–2664 (2017). - PubMed
    1. Chockley K., Emanuel E., The end of radiology? Three threats to the future practice of radiology. J. Am. Coll. Radiol. 13, 1415–1420 (2016). - PubMed
    1. Altae-Tran H., Ramsundar B., Pappu A. S., Pande V., Low data drug discovery with one-shot learning. ACS Cent. Sci. 3, 283–293 (2017). - PMC - PubMed
    1. Gawehn E., Hiss J. A., Schneider G., Deep learning in drug discovery. Mol. Inform. 35, 3–14 (2016). - PubMed

Publication types

MeSH terms

LinkOut - more resources