Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 28;3(3):278-286.e4.
doi: 10.1016/j.cels.2016.07.001. Epub 2016 Aug 18.

DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo

Affiliations
Free PMC article

DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo

Anthony Mathelier et al. Cell Syst. .
Free PMC article

Abstract

Interactions of transcription factors (TFs) with DNA comprise a complex interplay between base-specific amino acid contacts and readout of DNA structure. Recent studies have highlighted the complementarity of DNA sequence and shape in modeling TF binding in vitro. Here, we have provided a comprehensive evaluation of in vivo datasets to assess the predictive power obtained by augmenting various DNA sequence-based models of TF binding sites (TFBSs) with DNA shape features (helix twist, minor groove width, propeller twist, and roll). Results from 400 human ChIP-seq datasets for 76 TFs show that combining DNA shape features with position-specific scoring matrix (PSSM) scores improves TFBS predictions. Improvement has also been observed using TF flexible models and a machine-learning approach using a binary encoding of nucleotides in lieu of PSSMs. Incorporating DNA shape information is most beneficial for E2F and MADS-domain TF families. Our findings indicate that incorporating DNA sequence and shape information benefits the modeling of TF binding under complex in vivo conditions.

Figures

Figure 1
Figure 1. Feature Vectors of PSSM + DNA Shape and TFFM + DNA Shape Classifiers
Feature vectors combine sequence scores with respect to the TF binding profile (PSSM or TFFM), the normalized values of four DNA shape features (MGW, ProT, Roll, and HelT), and their normalized product terms at adjacent positions as second-order shape features (Zhou et al., 2015). In 4-bits + DNA shape classifiers, TF binding profile score is replaced by binary 4-bits encoding of the corresponding sequence (Zhou et al., 2015).
Figure 2
Figure 2. Effect of DNA Shape Features on TFBS Predictions in ChIP-seq Data
(A) AUPRC values obtained for 400 ENCODE human ChIP-seq datasets using PSSM scores (x-axis) or classifiers combining PSSM scores and DNA shape features (y-axis). Dashed line represents equal AUPRC values obtained with both methods. (B) Median AUPRC values over all ChIP-seq datasets associated with each TF (one point per TF), obtained using PSSM scores (x-axis) or PSSM + DNA shape classifiers (y-axis). Dashed line represents equal AUPRC values obtained with both methods. (C) Predictive power improvement obtained when considering DNA shape features (y-axis) as the difference between AUPRC values obtained with PSSM + DNA shape classifiers and PSSM scores. Larger difference corresponds to stronger improvement. Datasets (x-axis) are ranked by increasing difference values. (D) For each TF family (y-axis), an associated dataset is represented at the corresponding x-coordinate where the dataset appears in B. Names of TF families are given on y-axis, with significant Mann–Whitney U test p-values in parentheses (not corrected for multiple hypothesis testing). See also Data S2–S4 and S7.
Figure 3
Figure 3. Predictive Power of PSSM and 4-bit Approaches for TFBSs in ChIP-seq Regions
AUPRC values obtained for 400 human ENCODE ChIP-seq datasets, obtained by considering (A) PSSM scores (x-axis) and 4-bits classifiers (y-axis), or (B) PSSM + DNA shape (x-axis) and 4-bits + DNA shape (y-axis) classifiers. See also Data S5.
Figure 4
Figure 4. Predictive Power of DNA Shape Features at TFBS Flanking Regions
(A) AUPRC values obtained for 400 human ENCODE ChIP-seq datasets when using classifiers combining PSSM scores and DNA shape features at core TFBSs (x-axis), or classifiers combining PSSM scores and DNA shape features at both core TFBSs and surrounding 15 bp on each side (y-axis). Dashed line represents equal AUPRC values for both methods. (B) AUPRC value differences (y-axis) between flank-augmented classifiers and PSSM + DNA shape classifiers (x-axis). Datasets (x-axis) are ranked by increasing difference values. See also Data S6.
Figure 5
Figure 5. Use of a Single DNA Shape Feature Category for E2F and MADS-box TFBS Recognition in ChIP-seq
AUPRC values (y-axis) for E2F (A) and MADS-domain (B) TF datasets (x-axis), obtained by using all four first-order DNA shape features or a single feature category along with sequence features in the PSSM + DNA shape classifiers. See also Figure S3.
Figure 6
Figure 6. Structural Analysis of E2F and MADS-domain TFs in Complex with DNA
(A) Co-crystal structure (PDB ID 1CF7) of E2F4 (blue) and DP2 (magenta) forming a heterodimer that binds to core motif GCGC (red). (B) Detailed view of hydrogen bonds between arginines and guanines in major groove. (C) Co-crystal structure (PDB ID 1SRS) of MADS-domain SRF homodimer in complex with core motif CCTAATTAGG. (D) Detailed view of hydrogen bonds between lysine and guanines in major groove. (E) ProT in bound (blue; calculated from X-ray structure) and unbound (red; predicted by DNAshape) target site.
Figure 7
Figure 7. Feature Importance Measures for MADS-box Recognition in ChIP-seq Datasets
(A) AUPRC improvements (y-axis) for human and plant MADS-domain TF ChIP-seq datasets (x-axis provides TF names) when using PSSM + DNA shape features vs. PSSM scores. (B–C) Weblogos derived from JASPAR TF binding profile associated with (B) SRF (MA0083.2) and (C) MEF2C (MA0497.1) TFs are provided in top panels. Heat maps illustrating average feature importance values (y-axis) at each position (x-axis) of TFBSs in the classifiers trained for 10-fold CV analysis of ChIP-seq datasets are provided in bottom panels. Only feature importance measures associated with first-order DNA shape features are considered. Color scale for heat map is given on the right of the heat map. Red boxes highlight core MADS-box motif (CCW6GG). Blue boxes highlight edges of motifs. See also Figures S1, S2, and S4–S6.

Similar articles

See all similar articles

Cited by 35 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback