Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 13;15(1):472.
doi: 10.1186/1471-2164-15-472.

Improving Analysis of Transcription Factor Binding Sites Within ChIP-Seq Data Based on Topological Motif Enrichment

Affiliations
Free PMC article

Improving Analysis of Transcription Factor Binding Sites Within ChIP-Seq Data Based on Topological Motif Enrichment

Rebecca Worsley Hunt et al. BMC Genomics. .
Free PMC article

Abstract

Background: Chromatin immunoprecipitation (ChIP) coupled to high-throughput sequencing (ChIP-Seq) techniques can reveal DNA regions bound by transcription factors (TF). Analysis of the ChIP-Seq regions is now a central component in gene regulation studies. The need remains strong for methods to improve the interpretation of ChIP-Seq data and the study of specific TF binding sites (TFBS).

Results: We introduce a set of methods to improve the interpretation of ChIP-Seq data, including the inference of mediating TFs based on TFBS motif over-representation analysis and the subsequent study of spatial distribution of TFBSs. TFBS over-representation analysis applied to ChIP-Seq data is used to detect which TFBSs arise more frequently than expected by chance. Visualization of over-representation analysis results with new composition-bias plots reveals systematic bias in over-representation scores. We introduce the BiasAway background generating software to resolve the problem. A heuristic procedure based on topological motif enrichment relative to the ChIP-Seq peaks' local maximums highlights peaks likely to be directly bound by a TF of interest. The results suggest that on average two-thirds of a ChIP-Seq dataset's peaks are bound by the ChIP'd TF; the origin of the remaining peaks remaining undetermined. Additional visualization methods allow for the study of both inter-TFBS spatial relationships and motif-flanking sequence properties, as demonstrated in case studies for TBP and ZNF143/THAP11.

Conclusions: Topological properties of TFBS within ChIP-Seq datasets can be harnessed to better interpret regulatory sequences. Using GC content corrected TFBS over-representation analysis, combined with visualization techniques and analysis of the topological distribution of TFBS, we can distinguish peaks likely to be directly bound by a TF. The new methods will empower researchers for exploration of gene regulation and TF binding.

Figures

Figure 1
Figure 1
Composition-bias plots reveal systematic TF PFM nucleotide content bias in motif over-representation analysis. Foreground data was obtained from an E2F1 ChIP-Seq study and processed using the oPOSSUM TFBS over-representation analysis software. Each plot presents a motif over-representation score (y-axis) relative to the GC content of the PFMs (x-axis). The over-representation scores reflect the difference between the frequency of motifs in the foreground compared to a background (the background differs between panels). The names of the 5 top ranked PWMs are displayed on the plot. The dotted line at over-representation score 100 is an arbitrarily placed visual point of reference. The sequence logos represent the binding models for E2F1 and E2F4 respectively. (a) Background composed of randomly selected mappable genome sequences. (b) Background generated, using BiasAway, with a GC composition matched to the ChIP-Seq sequences and drawn from the set of mappable sequences.
Figure 2
Figure 2
Background sequence selection impacts motif over-representation analyses. (a) For each background, the fraction of the 43 analyses that reported the ChIP’d TF in the top 5 enriched PWMs from a particular background (x-axis) is plotted against the average skew of the over-representation results for each background’s 43 analyses. Skew is the negative slope of the line fitted to the over-representation scores versus PFM GC content (i.e. values as displayed in Figure  1). The ideal is to have a large x-axis value (vertical dashed line) and an average skew of zero (horizontal dashed line). (b) and (c) summarize the standard deviation (y-axis) and mean (x-axis) of the ‘non-outlier’ oPOSSUM over-representation scores for 10 backgrounds against each of 43 ChIP-Seq datasets, where panel (b) displays the average value for each background across the 43 datasets and panel (c) displays the individual value of 430 analyses. The ideal result would be situated at the origin (the intersection of the dashed lines). For all panels each of the 10 backgrounds tested is denoted as a single colour: Light green circle – randomly chosen background from the dataset of mappable sequences, dark green cross – randomly chosen background from the dataset of DNase accessible sequences, orange circle – mononucleotide shuffled background, brown cross – mononucleotide shuffled background within a sliding window, black circle – dinucleotide shuffled background, gray cross – dinucleotide shuffled background within a sliding window, magenta triangle – 3rd order Markov model generated background sequences, blue circle – background selected from the mappable sequences dataset to match the GC composition of the target sequences, light blue cross – background selected from the mappable sequences dataset to match the distribution of GC composition in sliding windows of the target sequences, and red triangle – GC background from HOMER 2.
Figure 3
Figure 3
TFBS-landscape views inform of motif enrichment and are diverse in shape and density. A TFBS-landscape view consists of a plot (left) showing the location of the top scoring motif relative to the peakMax (x = 0) on the x-axis and the motif score on the y-axis; the right plot presents a 2 bp resolution density of motif distances to the peakMax (black) and 5 bp resolution for motifs with a motif score equal to or greater than 85 (green). All plots display some degree of enrichment at the peakMax and a lower limit on motif score enrichment. (a) C/EBPB motifs in a C/EBPB ChIP-Seq dataset. (b) C-MYC motifs are enriched at the peakMax of the C-MYC ChIP-Seq dataset, but many top-scoring motifs are randomly dispersed. (c) NFYA motifs in a NFYA ChIP-Seq dataset exhibit enrichment around the peakMax, with high scoring motifs distinct from the majority of scores in the background regions (d) ZNF143 motifs in a ZNF143 ChIP-Seq dataset have low enrichment but some high scoring motifs are distinct from the background. (e) and (f) present JUN motif enrichment in two JUN datasets from different cell types with distinct background motif densities. (g) The REST motif is strongly enriched in a REST ChIP-Seq dataset across a large motif score range at the peakMax with a low density of background motifs. (h) MYOD motif enrichment at the peakMax of MYOD ChIP-Seq data. (i) HNF4A motif enriched proximal to the peakMax of HNF4A ChIP-Seq data. (j), (k) and (l) present motif enrichment for a TF that was not the ChIP’d target: (j) CTCF motifs are slightly enriched offset from the peakMax of H3k4me3 ChIP-Seq. (k) CTCF motifs are enriched offset from the peakMax in RAD21::cohesin ChIP-Seq data. (l) ELK4 motifs in a NELFE ChIP-Seq dataset show an enrichment offset from the peakMax.
Figure 4
Figure 4
Defining the TFBS zone of enrichment around the peakMax. (a) A visual depiction of the enrichment zone determined with a heuristic procedure, as described in methods. The x-axis presents the upper limit of each 5 bp bin, and the bins are the absolute distance of a motif from the peakMax. The y-axis shows the proportion of peaks from the dataset with a motif in a 5 bp bin. The horizontal green line is fit to the distal background bins, and the horizontal line in blue is the allowance line (see methods). The blue vertical dashed line indicates where the proportion of peaks in a non-background bin approaches the allowance line without falling below it – this line is the heuristic distance threshold for motif enrichment around the peakMax. (b) The NFYB sequence logo and the TFBS-landscape view for NFYB motifs in NFYB ChIP-Seq data. The heuristic enrichment zone is between the blue vertical dashed lines. The black vertical lines indicate the beginning of the distal background region (spanning 200-500 bp from the peakMax). (c) The width of the motif enrichment zone (y-axis) for human ChIP-Seq datasets (x-axis); multiple datasets for a TF were averaged to obtain one enrichment zone value per TF. Vertical bars are the average differences between all of the enrichment zone widths for a TF. The red horizontal line is the mean. (d) The proportion of peaks within the enrichment zone for a TF’s set of ChIP-Seq datasets were averaged. The x-axis provides, for each of 85 TFs, the mean proportion of peaks with a motif scoring above the motif score threshold and located within the zone of enrichment (mean 0.60, median 0.61).
Figure 5
Figure 5
TFBS-bi-motif view for visualization of motif spatial arrangements. The left plot of a TFBS-bi-motif view presents the distance of the primary motif to the peakMax of each sequence on the y-axis, and the distance of a second motif relative to the primary motif on the x-axis. A band of enrichment at y = 0 indicates enrichment near the peakMax for the primary motif, while a diagonal band of enrichment (with a negative slope) indicates enrichment near the peakMax for the second motif. The diagonal limits of the plot arise from the uniform length of the sequences (here 1001 bp). The right plot is a histogram of the distances between the two motifs. The gap in both plots results from the exclusion of overlapping motifs. (a) ESRRB motifs in an ESRRB ChIP-Seq data set. The top-scoring ESRRB motif is the primary motif, and the second-best motif is the second motif. (b) NFYB motifs in a NFYB ChIP-Seq data set. The top-scoring NFYB motif is the primary motif, and the second-best motif is the second motif. (c) SRF ChIP-Seq dataset. The SRF top-scoring motif is the primary motif, and the ELK4 top-scoring motif is the second motif.
Figure 6
Figure 6
Dinucleotide-environment view plots the dinucleotide enrichment of the dataset around the motif of interest. The x-axis shows the dinucleotide offset from the centre of the motif, and the y-axis is the proportion of the dinucleotide in the ChIP-Seq sequences. The STAT1 motif is the high frequency pattern in the centre of the plot. The sequence logo for STAT1 is above the high frequency pattern. (a) The subset of STAT1 ChIP-Seq peaks containing a STAT1 motif in the enrichment zone with a motif score greater than or equal to 85. The magenta box highlights the enrichment of dinucleotides in the flanking regions of STAT1 motifs proximal to the peakMax. (b) The subset of STAT1 peaks with a motif outside the enrichment zone and a motif score of 85 or greater. The magenta box highlights the lack of dinucleotide enrichment in the flanking regions of motifs found distal to the peakMax.
Figure 7
Figure 7
Case study of TBP in the mouse MEL cell-line. (a) TFBS-landscape view for TBP PWM on a TBP dataset. The top plot presents the top scoring motif distance to the peakMax (x-axis) and the motif relative score on the y-axis. The bottom plot presents a histogram of motif distances to the peakMax: black line – 2 bp resolution of the top scoring motif distance per peak; green line – 5 bp resolution of distances for the top scoring motifs with a score equal to or higher than 85. The sequence logo is for TBP. (b) TFBS-landscape view density plots for 15 PWM’s are overlaid on a single plot, for visualization purposes. The black line is the enrichment of the TBP motif, the coloured lines are NR2F1, MYC::MAX, CTCF, GABPA, TAL1::TCF3, FOSL2, FOXD3, NRF1, MEF2A, AP1, SPI1, ZNF143_b, E2F1, and NFYB motifs, as noted on the plot. (c) The Dinucleotide-environment view around the TBP motif. The x-axis is the location of the dinucleotide with respect to the TBP motif, and the y-axis is the fraction of sequences with the dinucleotide at a given position. The coloured lines each represent one of 16 dinucleotides, as specified in the plot legend. The magenta box highlights the dinucleotide enrichment in the regions flanking the TBP motif.
Figure 8
Figure 8
Case study of ZNF143 DNA binding preferences. The sequence logo presents the binding site characteristics of ZNF143. (a) Dinucleotide-environment view of ZNF143 ChIP-Seq repeat-masked regions aligned on motifs with a motif score of 85 or greater. The x-axis is the nucleotide position and the y-axis is the frequency of the dinucleotide. The coloured lines each represent one of 16 dinucleotides, as specified in the plot legend. The vertical magenta lines frame the positions of the sequence logo. (b) Dinucleotide-environment view of ZNF143 canonical motifs with a motif score of >85. The x-axis is the nucleotide position and the y-axis is the frequency of the dinucleotide. The coloured lines each represent one of 16 dinucleotides, as specified in the plot legend. The orange horizontal line above the plot indicates the overlapping THAP11 binding profile.

Similar articles

See all similar articles

Cited by 14 articles

See all "Cited by" articles

References

    1. Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010;38(Database issue):D105–D110. doi: 10.1093/nar/gkp950. - DOI - PMC - PubMed
    1. Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, Makeev VJ. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 2013;41(Database issue):D195–D202. doi: 10.1093/nar/gks1089. - DOI - PMC - PubMed
    1. Machanick P, Bailey TL. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics. 2011;27(12):1696–1697. doi: 10.1093/bioinformatics/btr189. - DOI - PMC - PubMed
    1. Georgiev S, Boyle AP, Jayasurya K, Ding X, Mukherjee S, Ohler U. Evidence-ranked motif identification. Genome Biol. 2010;11(2):R19. doi: 10.1186/gb-2010-11-2-r19. - DOI - PMC - PubMed
    1. Kulakovskiy IV, Boeva VA, Favorov AV, Makeev VJ. Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics. 2010;26(20):2622–2623. doi: 10.1093/bioinformatics/btq488. - DOI - PubMed

Publication types

Substances

LinkOut - more resources

Feedback