Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Jun;18(6):463-477.
doi: 10.1038/s41573-019-0024-5.

Applications of machine learning in drug discovery and development

Affiliations
Review

Applications of machine learning in drug discovery and development

Jessica Vamathevan et al. Nat Rev Drug Discov. 2019 Jun.

Abstract

Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.

PubMed Disclaimer

Figures

Fig. 1 |
Fig. 1 |. Machine learning applications in the drug discovery pipeline and their required data characteristics.
Several successful applications of machine learning in various stages of the drug development pipeline in pharmaceutical companies have been published. However, within each data domain, there are still challenges related to the standard of data quality and data quantity needed to capitalize on the full potential of these methods for discovery. ADME, absorption, distribution, metabolism and excretion.
Fig. 2 |
Fig. 2 |. Machine learning tools and their drug discovery applications.
This figure gives an overview of the machine learning techniques that have been used to answer the drug discovery questions covered in this Review. A range of supervised learning techniques (regression and classifier methods) are used to answer questions that require prediction of data categories or continuous variables, whereas unsupervised techniques are used to develop models that enable clustering of the data. ADME, absorption, distribution, metabolism and excretion; CNN, convolutional neural network; CT, computed tomography; DAEN, deep autoencoder neural network; DNN, deep neural network; GAN, generative adversarial network; MRI, magnetic resonance imaging; NLP, natural language processing; PK, pharmacokinetic; RNAi, RNA interference; RNN, recurrent neural network; SVM, support vector machine; SVR, support vector regression.
Fig. 3 |
Fig. 3 |. The challenges of compound structure representation in machine learning models.
The appropriate representation of chemical structures and their features can take on many representations depending on the required application. Extended-connectivity fingerprints (ECFPs) contain information about topological characteristics of the molecule, which enables this information to be applied to tasks such as similarity searching and activity prediction. A Coulomb matrix encodes information about the nuclear charges of a molecule and their coordinates. The grid featurizer method incorporates structural features of both the ligand and the target protein as well as the intermolecular forces that contribute to binding affinity. Symmetry function is another common encoding of atomic coordinate information, which focuses on the distance between atom pairs and the on angles formed within triplets of atoms. The graph convolution method computes an initial feature vector and a neighbour list for each atom that summarizes the local chemical environment of an atom, including atom types, hybridization types and valence structures. Weave featurization calculates a feature vector for each pair of atoms in the molecule, including bond properties (if directly connected), graph distance and ring info, forming a feature matrix. Reproduced by permission of the Royal Society of Chemistry, Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018), REF..
Fig. 4 |
Fig. 4 |. Utilizing predictive biomarkers to support drug discovery and development.
A drug sensitivity predictive model (yellow box) can be generated using machine learning approaches on preclinical data. The model could then be tested using data from early-stage clinical patient samples. Once validated, the model could be used for patient stratification and/or disease indication selection to support the clinical development of a drug, as well as to infer its mechanism of action. EN, elastic net; IHC, immunohistochemistry; MOA, mechanism of action; RF, random forest; SVM, support vector machine.
Fig. 5 |
Fig. 5 |. Computational pathology tasks for machine learning applications.
Deep learning frameworks can replace traditional handcrafted features in several basic pathology image-recognition tasks (such as segmentation of nuclei, epitheLia or tubules, lymphocyte detection, mitosis detection or classification of tumours) using image segmentation (yellow background), detection of specific features (blue background) or detection of a set of features used for classification (green background). Recognition is based on the task-specific features shown in the pink regions and can lead to more accurate prognosis or prediction of disease.

Similar articles

Cited by

References

    1. Mamoshina P et al. Machine learning on human muscle transcriptomic data for biomarker discovery and tissue-specific drug target identification. Front. Genet. 9, 242 (2018). - PMC - PubMed
    1. LeCun Y, Bengio Y & Hinton G Deep learning. Nature 521, 436 (2015). - PubMed
    1. Chen H, Engkvist O, Wang Y, Olivecrona M & Blaschke T The rise of deep learning in drug discovery. Drug Discov. Today 23, 1241–1250 (2018). This article is the first effort to highlight the recent applications of DL in drug discovery research and is an introduction to some popular DL architectures. - PubMed
    1. Hinton G Deep learning — a technology with the potential to transform health care. JAMA 320, 1101–1102 (2018). - PubMed
    1. Wong CH, Siah KW & Lo AW Estimation of clinical trial success rates and related parameters. Biostatistics 10.1093/biostatistics/kxx069 (2018). - DOI - PMC - PubMed