Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 27;45(13):e119.
doi: 10.1093/nar/gkx314.

RSAT Matrix-Clustering: Dynamic Exploration and Redundancy Reduction of Transcription Factor Binding Motif Collections

Affiliations
Free PMC article

RSAT Matrix-Clustering: Dynamic Exploration and Redundancy Reduction of Transcription Factor Binding Motif Collections

Jaime Abraham Castro-Mondragon et al. Nucleic Acids Res. .
Free PMC article

Abstract

Transcription factor (TF) databases contain multitudes of binding motifs (TFBMs) from various sources, from which non-redundant collections are derived by manual curation. The advent of high-throughput methods stimulated the production of novel collections with increasing numbers of motifs. Meta-databases, built by merging these collections, contain redundant versions, because available tools are not suited to automatically identify and explore biologically relevant clusters among thousands of motifs. Motif discovery from genome-scale data sets (e.g. ChIP-seq) also produces redundant motifs, hampering the interpretation of results. We present matrix-clustering, a versatile tool that clusters similar TFBMs into multiple trees, and automatically creates non-redundant TFBM collections. A feature unique to matrix-clustering is its dynamic visualisation of aligned TFBMs, and its capability to simultaneously treat multiple collections from various sources. We demonstrate that matrix-clustering considerably simplifies the interpretation of combined results from multiple motif discovery tools, and highlights biologically relevant variations of similar motifs. We also ran a large-scale application to cluster ∼11 000 motifs from 24 entire databases, showing that matrix-clustering correctly groups motifs belonging to the same TF families, and drastically reduced motif redundancy. matrix-clustering is integrated within the RSAT suite (http://rsat.eu/), accessible through a user-friendly web interface or command-line for its integration in pipelines.

Figures

Figure 1.
Figure 1.
Schematic flow chart of the matrix-clustering algorithm. The program takes as input one (or several) collection(s) of PSSMs, and calculates the motif similarity using several metrics. One of these metrics is used to group the motifs with hierarchical clustering. A threshold consisting in a combination of metrics is used to partition the global tree in a set of cluster-specific trees. Each resulting tree then serves as a guide to progressively align the PSSMs. The PSSMs at the root of each tree are exported as non-redundant motifs. The trees can be collapsed or expanded at each node dynamically on the resulting web page.
Figure 2.
Figure 2.
Clustering of PSSMs discovered in the Oct4 ChIP-seq peaks using several motif discovery tools. The Oct4 peaks identified by Chen et al. (41) were submitted to three de novo motif discovery programs: RSAT peak-motifs, MEME-ChIP and Homer. All discovered PSSMs were clustered simultaneously by matrix-clustering. (A) Hierarchical tree corresponding to cluster_1 (37 motifs), where different Oct motif variants and Sox2 motifs are highlighted with different coloured boxes. The leaves are annotated with the name of the submitted motif and the name of its collection (RSAT, MEME, HOMER). (B) Reduced tree showing six non-redundant motifs, obtained after manual curation of the cluster_1, by collapsing the branches. (C) Annotation of the six non-redundant variants (‘branch PSSMs’) based on alignments to reference motifs (see main text). When available in databases (JASPAR or HOCOMOCO), the ID of the reference motif is indicated. Otherwise, it is replaced by the PMID of the publication mentioning the motif. (D) Heatmap summarizing the number of motifs from each collection found in each cluster. (E) Heatmap of the cross-coverage between each collection.
Figure 3.
Figure 3.
Clustering of 12 sets of PSSMs discovered in mouse ESC TF ChIP-seq peaks. (A) Matrix showing the cluster composition by motif collection. Examples of motifs found in one or several collections (and their corresponding logos) are indicated with green and blue arrows, respectively. (B) Heatmap showing the cross-coverage between the 12 motif collections corresponding to the ESC TF peak-sets.
Figure 4.
Figure 4.
Clustering of complete Insect and Human motif databases. (A) Heatmap representing the similarity (Ncor) between all 133 PSSMs of JASPAR Insects. The 35 clusters found are indicated with a colored bar above the heatmap. The black square emphasizes the large cluster (almost half of the PSSMs) containing the very similar Homeodomain motifs. (B) The 70 Homeodomain motifs were manually reduced by collapsing the tree branches into ten motifs. The collapsed tree is displayed along with the corresponding aligned branch motifs. (C) Heatmap representing the similarity (Ncor) between all 641 PSSMs of HOCOMOCO Human. (D) Repartition of the clusters formed from HOCOMOCO Human with TF families. The bar plot indicates that most clusters are composed of a single TF family. The pie chart illustrates the reasons for observing multiple TF families in a single cluster. (E) Scatterplot comparing the number of members of each TF family as a function of the number of covered clusters. The name of the families with more than 20 members are shown. (F) Scatterplot showing the trade-off between sensitivity and specificity by clustering PSSMs from the same family with either matrix-clustering or STAMP, using different parameters to compute similarities between each pair of input matrices, build the trees and define the clusters. For matrix-clustering, the curves denote a series of tests performed with different threshold values on the same similarity metric. For STAMP, the number of clusters is defined automatically. Dot sizes are proportional to the Adjusted Rand Index (ARI). The ideal clustering would be in the top-right corner.
Figure 5.
Figure 5.
Cross-coverage of public motif databases. Several full public collections were merged and clustered, separately by taxa. The heatmaps of the cross-coverage between each collection is plotted for (A) seven insect collections, (B) five plant databases and (C) twelve vertebrate databases. The heatmaps show the cross-coverages for each pair of databases. Note that the heatmaps are not symmetrical because the numbers of motifs in the different databases differ. (D) Venn diagrams showing the asymmetry of cross-coverage between two databases with different sizes.
Figure 6.
Figure 6.
Information content and predictive power of the unclustered and clustered motifs. (A) Clustering of five HOCOMOCO human motifs bound by NF-kappaB-related factors. The blue numbers at the tree branches indicate the information content of each PSSM. The number of sites used to build each matrix is indicated in parenthesis. (B) Logos of the merged motif at each level of the tree. (C) Enrichment of predicted NFKb binding sites for original and merged (branch + root) motifs in ChIP-seq peaks for RelA in Hodgkin Lymphoma Cell Line (GEO entry GSM1556331) (64). (D) Distribution of IC relative differences measured during the clustering of 519 human TFBMs from HOCOMOCO.

Similar articles

See all similar articles

Cited by 21 articles

See all "Cited by" articles

References

    1. Stormo G.D. DNA binding sites: representation and discovery. Bioinformatics. 2000; 16:16–23. - PubMed
    1. Schneider T.D., Stormo G.D., Gold L., Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J. Mol. Biol. 1986; 188:415–431. - PubMed
    1. Weirauch M.T., Cote A., Norel R., Annala M., Zhao Y., Riley T.R., Saez-Rodriguez J., Cokelaer T., Vedenko A., Talukder S. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 2013; 31:126–134. - PMC - PubMed
    1. Jolma A., Yan J., Whitington T., Toivonen J., Nitta K.R., Rastas P., Morgunova E., Enge M., Taipale M., Wei G. et al. DNA-binding specificities of human transcription factors. Cell. 2013; 152:327–339. - PubMed
    1. Mathelier A., Wasserman W.W. The next generation of transcription factor binding site prediction. PLoS Comput. Biol. 2013; 9:e1003214. - PMC - PubMed

Substances

Feedback