Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov 15;12(11):e0187243.
doi: 10.1371/journal.pone.0187243. eCollection 2017.

Nucleotide Patterns Aiding in Prediction of Eukaryotic Promoters

Free PMC article

Nucleotide Patterns Aiding in Prediction of Eukaryotic Promoters

Martin Triska et al. PLoS One. .
Free PMC article


Computational analysis of promoters is hindered by the complexity of their architecture. In less studied genomes with complex organization, false positive promoter predictions are common. Accurate identification of transcription start sites and core promoter regions remains an unsolved problem. In this paper, we present a comprehensive analysis of genomic features associated with promoters and show that probabilistic integrative algorithms-driven models allow accurate classification of DNA sequence into "promoters" and "non-promoters" even in absence of the full-length cDNA sequences. These models may be built upon the maps of the distributions of sequence polymorphisms, RNA sequencing reads on genomic DNA, methylated nucleotides, transcription factor binding sites, as well as relative frequencies of nucleotides and their combinations. Positional clustering of binding sites shows that the cells of Oryza sativa utilize three distinct classes of transcription factors: those that bind preferentially to the [-500,0] region (188 "promoter-specific" transcription factors), those that bind preferentially to the [0,500] region (282 "5' UTR-specific" TFs), and 207 of the "promiscuous" transcription factors with little or no location preference with respect to TSS. For the most informative motifs, their positional preferences are conserved between dicots and monocots.

Conflict of interest statement

Competing Interests: AK is the Founder and Chief Scientific Officer of GeneXplain GmbH. VS is the Chief Scientific Officer of Softberry, Inc. This does not alter our adherence to PLOS ONE policies on sharing data and materials.


Fig 1
Fig 1. RNA-Seq coverage near 12 randomly selected promoters with experimentally validated transcription start sites.
Fig 2
Fig 2. Features of the nucleotide consensus around TSS.
A top left) Frequency of CA, B top right) Frequency of TATA motif, D middle feft) Frequencies of nucleotides A, C, G, T around TSS for Fgenesh, E middle right) Frequencies of A, C, G, T around TSS for MSU, F bottom) CG skew (CGskew=#C#G#C+#G), calculated in the window of 40 nt.
Fig 3
Fig 3. Examples of observed and expected occurrences of TFBS in rice promoters.
Different: TCP15, LIM1, HBP1A, TCP23, ARALY493022, AT1G26610, TFIIAL, BZIP910, CBF1, DREB1F, STY1. Observations agree with expectations: CMTA2, GATA1, SBF1, WRKY48.
Fig 4
Fig 4. Positional specificity of TFBS distribution.
Fig 5
Fig 5. The distribution pattern for MADSB binding sites highlight the start codon (ATG) rather than the respective TSS.
Fig 6
Fig 6. Frequency distributions of TFBS may have different patterns around the start of transcription (position 0 on the horizontal axis).
X-axis shows the distance from TSS, Y-axis reflects the frequency of motif in each window. Frequencies of ARALY493022_04 TFBS (Class 1) are plotted on the left panel, of RAP26_03 TFBS (Class 2) on the middle panel, and of MYB111_02 (Class 3) on the right panel.
Fig 7
Fig 7. Relationship between information content of TFBS positions in rice, corn and Arabidopsis.
Each point corresponds to one transcription factor; X axis shows information content in rice, Y axis–information content in corn and Arabidopsis.
Fig 8
Fig 8. Assessment of promoter prediction quality in Arabidopsis (left) and corn (right).
Arabidopsis genome shows more pronounced consensus at TSS, with higher frequency of TATA motif at -30 and CA at TSS.
Fig 9
Fig 9. An example of five distinct TFBS entries in the TRANSFAC database with very similar position weight matrices (PWMs).
Fig 10
Fig 10. Frequency of SNPs located near the TSS in rice.
Fig 11
Fig 11. RNA-Seq coverage near the transcription start site.
Fig 12
Fig 12. Methylation around transcription start site in rice in different sequence contexts.
Red–CG, green—CHG, blue–CHH, where H denotes A, C or T nucleotide.
Fig 13
Fig 13. Basic CNN architecture that was used in building promoter models implemented in the program [3, 10].

Similar articles

See all similar articles

Cited by 4 articles


    1. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, Hume DA. Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet. 2007;8(6):424–36. doi: 10.1038/nrg2026 . - DOI - PubMed
    1. Solovyev VV, Shahmuradov IA, Salamov AA. Identification of promoter regions and regulatory sites. Methods Mol Biol. 2010;674:57–83. doi: 10.1007/978-1-60761-854-6_5 . - DOI - PubMed
    1. Shahmuradov IA, Umarov RK, Solovyev VV. TSSPlant: a new tool for prediction of plant Pol II promoters. Nucleic Acids Res. 2017. doi: 10.1093/nar/gkw1353 . - DOI - PMC - PubMed
    1. Troukhan M, Tatarinova T, Bouck J, Flavell RB, Alexandrov NN. Genome-wide discovery of cis-elements in promoter sequences using gene expression. OMICS. 2009;13(2):139–51. doi: 10.1089/omi.2008.0034 . - DOI - PubMed
    1. Tatarinova T, Kryshchenko A, Triska M, Hassan M, Murphy D, Neely M, et al. NPEST: a nonparametric method and a database for transcription start site prediction. Quant Biol. 2014;1(4):261–71. doi: 10.1007/s40484-013-0022-2 ; PubMed Central PMCID: PMCPMC4156414. - DOI - PMC - PubMed

Grant support

AK was supported by a grant of the Federal Targeted Program “Research and development on priority directions of science and technology in Russia, 2014–2010”, Contract № 14.604.21.0101, unique identifier of the applied scientific project: RFMEFI60414X0101. AK's work was also supported by the following grants of the EU FP7 program: “SYSCOL”, “SysMedIBD”, “RESOLVE” and “MIMOMICS”. TT and MT were supported by the NSF Division of Environmental Biology (1456634). TT, MT and AB were supported by NSF STTR award 1622840. Additional funding was provided by GeneXplain GmbH in the form of salaries for AK, and by Softberry, Inc in the form of salary for VS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.