Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 3 (1), 11

Interpreting Linear Support Vector Machine Models With Heat Map Molecule Coloring

Affiliations

Interpreting Linear Support Vector Machine Models With Heat Map Molecule Coloring

Lars Rosenbaum et al. J Cheminform.

Abstract

Background: Model-based virtual screening plays an important role in the early drug discovery stage. The outcomes of high-throughput screenings are a valuable source for machine learning algorithms to infer such models. Besides a strong performance, the interpretability of a machine learning model is a desired property to guide the optimization of a compound in later drug discovery stages. Linear support vector machines showed to have a convincing performance on large-scale data sets. The goal of this study is to present a heat map molecule coloring technique to interpret linear support vector machine models. Based on the weights of a linear model, the visualization approach colors each atom and bond of a compound according to its importance for activity.

Results: We evaluated our approach on a toxicity data set, a chromosome aberration data set, and the maximum unbiased validation data sets. The experiments show that our method sensibly visualizes structure-property and structure-activity relationships of a linear support vector machine model. The coloring of ligands in the binding pocket of several crystal structures of a maximum unbiased validation data set target indicates that our approach assists to determine the correct ligand orientation in the binding pocket. Additionally, the heat map coloring enables the identification of substructures important for the binding of an inhibitor.

Conclusions: In combination with heat map coloring, linear support vector machine models can help to guide the modification of a compound in later stages of drug discovery. Particularly substructures identified as important by our method might be a starting point for optimization of a lead compound. The heat map coloring should be considered as complementary to structure based modeling approaches. As such, it helps to get a better understanding of the binding mode of an inhibitor.

Figures

Figure 1
Figure 1
Illustration of the ECFP. Each circular substructure around a center atom represents an ECFP feature. The circular substructure is grown in each iteration at the attachment points A. Any atom can be matched on an attachment point. Aromatic bonds are marked by a dashed line.
Figure 2
Figure 2
Illustration of pattern to bond weight mapping. The weight wj of a pattern pj is added to the score sb of a bond b if the bond is contained in the pattern pj. Attachment points A can be mapped on any atom. Aromatic bonds are marked by a dashed line.
Figure 3
Figure 3
Kazius data set example compounds. A heat map coloring of the non-toxic compound 1028-11-1 (CA) and the toxic compound 146795-38-2 (CB). Both compounds were predicted correctly. The color gradient ranges from green (toxic) to red (non-toxic). Both, the single molecule normalization (A,C) and the full data set normalization were applied (B,D). Compound CA contains a correctly identified aromatic nitro toxicophore. However, the compound has a detoxifying sulfonamide as well, rendering the compound non-toxic. The sulfonamide and parts of the aromatic ring were identified as non-toxic. In compound CB the aromatic nitro toxicophore was also identified as toxicophore. Compound CB is toxic because the red chlorobenzene substructure is not a detoxifying substructure.
Figure 4
Figure 4
Orientation of different protein kinase A ligands. Binding orientation of the ligands of PDB entries 3MVJ (LA), 3DNE (LB), and 3DND (LC). Compounds within the binding pocket were colored with the single molecule normalization, the compounds above with the full set normalization. The color gradient ranges from green (important for activity) to red (unimportant or even decreasing for activity). The binding pocket is indicated as an exclusion surface. Substructures, which are located at similar positions in the binding pocket, were colored similarly by the heat map coloring approach.
Figure 5
Figure 5
Aligned binding pockets of LA and LB. The binding pockets of LA and LB were aligned and the ligands were colored with the single molecule normalization. The color gradient ranges from green (important for activity) to red (unimportant for activity or even decreasing). The green protein residues belong to LA and the orange ones to LB. The binding pocket is indicated as an exclusion surface. H-bonds detected by Schrödinger are indicated by a dashed line. Two similar basic aromatic rings located deep in the binding pocket are identified as important for activity.
Figure 6
Figure 6
Clavatadine A. Clavatadine A colored according to a model trained on MUV846. The color gradient ranges from green (important for activity) to red (unimportant for activity). The current molecule normalization (A) and the full data set normalization (B) were both applied. The carbamate substructure is marked as important for activity.

Similar articles

See all similar articles

Cited by 12 articles

See all "Cited by" articles

References

    1. Bajorath J. Integration of virtual and high-throughput screening. Nat Rev Drug Discov. 2002;1:882–894. doi: 10.1038/nrd941. - DOI - PubMed
    1. Bleicher KH, Böhm HJ, Müller K, Alanine AI. Hit and lead generation: beyond high-throughput screening. Nat Rev Drug Discov. 2003;2:369–378. doi: 10.1038/nrd1086. - DOI - PubMed
    1. Bender A, Mussa HY, Glen RC, Reiling S. Molecular similarity searching using atom environments, information-based feature selection, and a naïve Bayesian classifier. J Chem Inf Comput Sci. 2004;44:170–178. - PubMed
    1. Han L, Wang Y, Bryant S. Developing and validating predictive decision tree models from mining chemical structural fingerprints and high-throughput screening data in PubChem. BMC Bioinformatics. 2008;9:401. doi: 10.1186/1471-2105-9-401. - DOI - PMC - PubMed
    1. Swamidass SJ, Azencott CA, Lin TW, Gramajo H, Tsai SC, Baldi P. Influence relevance voting: an accurate and interpretable virtual high throughput screening method. J Chem Inf Model. 2009;49:756–766. doi: 10.1021/ci8004379. - DOI - PMC - PubMed

LinkOut - more resources

Feedback