Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 16;42(22):13500-12.
doi: 10.1093/nar/gku1228. Epub 2014 Nov 26.

Prediction of DNA binding motifs from 3D models of transcription factors; identifying TLX3 regulated genes

Affiliations

Prediction of DNA binding motifs from 3D models of transcription factors; identifying TLX3 regulated genes

Mario Pujato et al. Nucleic Acids Res. .

Abstract

Proper cell functioning depends on the precise spatio-temporal expression of its genetic material. Gene expression is controlled to a great extent by sequence-specific transcription factors (TFs). Our current knowledge on where and how TFs bind and associate to regulate gene expression is incomplete. A structure-based computational algorithm (TF2DNA) is developed to identify binding specificities of TFs. The method constructs homology models of TFs bound to DNA and assesses the relative binding affinity for all possible DNA sequences using a knowledge-based potential, after optimization in a molecular mechanics force field. TF2DNA predictions were benchmarked against experimentally determined binding motifs. Success rates range from 45% to 81% and primarily depend on the sequence identity of aligned target sequences and template structures, TF2DNA was used to predict 1321 motifs for 1825 putative human TF proteins, facilitating the reconstruction of most of the human gene regulatory network. As an illustration, the predicted DNA binding site for the poorly characterized T-cell leukemia homeobox 3 (TLX3) TF was confirmed with gel shift assay experiments. TLX3 motif searches in human promoter regions identified a group of genes enriched in functions relating to hematopoiesis, tissue morphology, endocrine system and connective tissue development and function.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Flowchart of the TF2DNA method. For each TF target sequence a profile HMM is built. The HMM of the target is aligned to all pre-calculated template HMMs in our collection of manually curated TF-DNA structures. The best alignment is identified using 100% coverage of the template binding site and highest sequence identity of the aligned region. The obtained alignment is then used to generate a homology model of the target sequence. Alternative TF-DNA complex models are obtained by swapping the DNA bases by all possible sequences of length k (length of the DNA in the model structure). Using a knowledge-based atomistic pair potential we score all the resulting 4k number of TF-DNA interfaces. The scores are normalized in the range 0–1 and a cutoff is applied to identify the group of sequences that define the target TF binding sites. The resulting binding sites are used to model the binding motif as a position weight matrix.
Figure 2.
Figure 2.
Performance of TF2DNA at predicting TF binding motifs. (A) Performance of the TF2DNA method as measured by the percent of correctly predicted test cases. A prediction is correct when the predicted and experimentally determined sequence motifs (from JASPAR and/or UniPROBE) show a similarity Z-score of 2, or higher. A Z-score of 2 or higher means that two compared motifs are similar at 95% confidence level. Performance is shown for eight test-sets of TF, which sets differ by their target-template sequence identity. (B) Comparison of performances between the Robertson–Varani knowledge-based potential (gray bars) and the RosettaDNA potential (blue bars), similarly to panel A. Each test-set bin contains 10 randomly chosen cases. Here, the plot shows the percent of test cases above a motif similarity Z-score of 1 (i.e. correct prediction is already assumed at 66% confidence level.). This lower expected confidence level was chosen to enhance the signal produced by RosettaDNA, which did not predict any motif correctly when Z-score expectation was set at 2. (C) Four examples of predicted motifs at different TF target-template sequence identities are shown with sequence logo representations: Egr1 (early growth response protein 1), Otp (orthopedia homolog from D. melanogaster), Foxa2 (forkhead box A2 protein) and Six4 (sine oculis-related homeobox 4 homolog from D. melanogaster). The sequence identities to their templates (and their database motif similarity Z-scores) are: 100% (10), 69% (4.2), 49% (10) and 31% (2.1), respectively. (D) Boxplots showing the distributions of template interface conservations (TIC), which is measured as the percent target-template sequence identity of residues in direct contact with DNA bases (within 4.5 Ang of any base atom in the template structure). (E) Boxplot of distributions of residue contact energies in the modeled structures as estimated by ProSA energy scores (71). Boxplot interpretation: filled squares show averages, boxes display quartiles, whiskers are at 5% and 95% of data and crosses show minimum and maximum values.
Figure 3.
Figure 3.
TF2DNA predictions are template independent. Examples of template sequence independence of the motif predictions. On the first row, motifs produced by the TF protein sequences of the templates: Klf4 (gut Kruppel-like factor 4), Egr1 (early growth response protein 1), PBX1 (pre-B-cell leukemia homeobox) and eve (even-skipped). The second row shows the predicted motifs for the TF target sequences: SFP1 (Split finger protein 1), hb (hunchback protein), CG11617 (unknown protein with homology predicted TF function) and Duxl (double homeobox B-like protein). Each target sequence was modeled using the corresponding same-column template structure. Their respective TIC values are indicated within parenthesis. The third row shows their experimentally determined motif according to the JASPAR or UniPROBE databases. Expected motifs are correctly predicted (with Z-scores above 2) even in such cases, as SFP1, where the residues at the protein-DNA interface were completely replaced.
Figure 4.
Figure 4.
Experimental validation of predicted human TLX3 binding sequences. Left panel: Prediction of binding preferences for the human TLX3 TF. The top three predicted binding sequences, which were used for EMSA assays, are displayed as well as the consensus binding motif. The sequence highlighted in blue showed binding to TLX3 in the EMSA assay. Right panel: Results of the EMSA assay using sequence #2 (predicted sequence highlighted in blue), referred to as probe. Lane 1: Negative control (probe only). Lane 2: Biotinylated probe. Lane 3: Cold probe (competition assay). Lane 4: Scrambled probe. The yellow circle marks the shifted, bound motif with TLX3.
Figure 5.
Figure 5.
Predicted function of the TLX3 TF. (A) Protein expression levels of TLX3 as reported in the Human Protein Atlas (60). The tissue types were broadly grouped and the percent of observed expression levels were calculated for the tested subtissues within each category. Detailed expression levels in subtissues are presented in Supplementary Table S9. (B) Ingenuity pathway analysis of observed targets genes of TLX3 obtained with the TF2DNA predicted binding motif. The figure shows the five most significantly enriched networks in the physiological system development and function category. Sphere sizes are proportional (logarithmic scale) to the amount of genes populating the category. The TLX3 target genes that were enriched within this category are listed in Supplementary Table S12.

Similar articles

Cited by

References

    1. Ernst P., Smale S.T. Combinatorial regulation of transcription. I: general aspects of transcriptional control. Immunity. 1995;2:311–319. - PubMed
    1. Slattery M., Zhou T., Yang L., Dantas Machado A.C., Gordan R., Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem. Sci. 2014;39:381–399. - PMC - PubMed
    1. Stormo G.D., Zhao Y. Determining the specificity of protein-DNA interactions. Nat. Rev. Genet. 2010;11:751–760. - PubMed
    1. Pujato M., MacCarthy T., Fiser A., Bergman A. The underlying molecular and network level mechanisms in the evolution of robustness in gene regulatory networks. PLoS Comput. Biol. 2013;9:e1002865. - PMC - PubMed
    1. MacCarthy T., Bergman A. The limits of subfunctionalization. BMC Evol. Biol. 2007;7:14. - PMC - PubMed

Publication types