Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug 8;7(1):7596.
doi: 10.1038/s41598-017-07761-0.

Modelling the evolution of transcription factor binding preferences in complex eukaryotes

Affiliations

Modelling the evolution of transcription factor binding preferences in complex eukaryotes

Antonio Rosanova et al. Sci Rep. .

Abstract

Transcription factors (TFs) exert their regulatory action by binding to DNA with specific sequence preferences. However, different TFs can partially share their binding sequences due to their common evolutionary origin. This "redundancy" of binding defines a way of organizing TFs in "motif families" by grouping TFs with similar binding preferences. Since these ultimately define the TF target genes, the motif family organization entails information about the structure of transcriptional regulation as it has been shaped by evolution. Focusing on the human TF repertoire, we show that a one-parameter evolutionary model of the Birth-Death-Innovation type can explain the TF empirical repartition in motif families, and allows to highlight the relevant evolutionary forces at the origin of this organization. Moreover, the model allows to pinpoint few deviations from the neutral scenario it assumes: three over-expanded families (including HOX and FOX genes), a set of "singleton" TFs for which duplication seems to be selected against, and a higher-than-average rate of diversification of the binding preferences of TFs with a Zinc Finger DNA binding domain. Finally, a comparison of the TF motif family organization in different eukaryotic species suggests an increase of redundancy of binding with organism complexity.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1
Graphical representation of motif families. The table summarizes the organization of the DBD families in motif families. Each vertex is a transcription factor, while links connect two TFs if they share at least one Position Weight Matrix ID. Colors identify the node degree with a color code (reported in the legend) spanning from blue, corresponding to degree 0 (isolated nodes), to red, corresponding to the maximal degree, i.e., the vertex is connected to all other nodes in the family. The circular layout highlights those families that are cliques. See Supplementary Material for reference to detailed family composition.
Figure 2
Figure 2
Size distribution of motif families in human. The distribution accounts for 906 human TFs, organized in 424 families whose members share at least one PWM with at least another member. The inset is a zoom on the range of sizes >5. The red-line is the best-fit model according to maximum likelihood estimation, which has a goodness-of-fit p-value p < 0.0001. The model captures the general trend, but clearly underestimates the number of families of size 1 and does not predict the presence of the largest families.
Figure 3
Figure 3
(a) Heatmap for the goodness-of-fit p-value as the data sample is reduced. On the x-axis, T indicates the threshold in size above which families are excluded from the sample. On y-axis N s indicates the number of families of size one excluded from the sample. An increase in T or N s reduces the sample size in analysis by reducing the number of TFs considered. For each sample size a goodness-of-fit test for the best-fit model was performed and the corresponding p-value is reported with the color code in the legend. Considering a p-value of 0.75 as the acceptance limit identifies T = 25 as the size threshold at which the fit is acceptable. This corresponds to the exclusion of the three largest families. For such a threshold T, the optimal values for the p-value are reached for values of N s in the range 40 < N s < 80. (b) Size distribution of motif families for the reduced sample. The filled distribution represents the motif family size distribution for the reduced dataset, while the original empirical distribution of Fig. 2 is reported with the unfilled bars. The inset shows a zoom on the range of sizes >5. The line represents the prediction of our model with the best fit choice of the parameter θ, which turns out to fit very well the data contained in the reduced sample with a goodness of fit p-value p ~ 0.8. The best fit value θ = 0.73 does not differ substantially from the value that is obtained by fitting the whole empirical sample as in Fig. 2.
Figure 4
Figure 4
Splitting of DBD families in motif families. The ratio F/N is plotted as a function of the number of TFs of the DBD family. Each point represents the empirical value for a DBD family, while the dashed line represents the expected value for θ = 0.74 as given by Eq. 4. In order to evaluate the fluctuations on the expectation, we simulated the evolution of 5 * 104 DBD families, with starting size ranging from 1 to 500, θ = 0.74 and λ = δ. The two shaded areas correspond to 1 standard deviation and 3 standard deviations from the average. Green diamond: Zinc Finger C2H2 family. Cyan diamond: Homeobox family. Red diamond: Forkhead family.
Figure 5
Figure 5
Size distribution of TF motif families for different eukaryotic organisms. (ae) We report the distributions and the best fit values of θ for five different organisms of increasing complexity. As for the human case, the data are taken from the CIS-BP database. (f) The bottom-right panel shows how θ scales with the number of TFs. The red-line is the fit θ ~ (# TFs)0.85.

Similar articles

Cited by

References

    1. Accili D, Arden KC. Foxos at the crossroads of cellular metabolism, differentiation, and transformation. Cell. 2004;117:421–426. doi: 10.1016/S0092-8674(04)00452-0. - DOI - PubMed
    1. Bain G, et al. E2a proteins are required for proper b cell development and initiation of immunoglobulin gene rearrangements. Cell. 1994;79:885–892. doi: 10.1016/0092-8674(94)90077-9. - DOI - PubMed
    1. Dynlacht BD. Regulation of transcription by proteins that control the cell cycle. Nature. 1997;389:149–152. doi: 10.1038/38225. - DOI - PubMed
    1. Furney SJ, Higgins DG, Ouzounis CA, López-Bigas N. Structural and functional properties of genes involved in human cancer. BMC Genomics. 2006;7:3. doi: 10.1186/1471-2164-7-3. - DOI - PMC - PubMed
    1. Bustamante CD, et al. Natural selection on protein-coding genes in the human genome. Nature. 2005;437:1153–1157. doi: 10.1038/nature04240. - DOI - PubMed

Publication types

MeSH terms