Background: Core promoters are the gene regulatory regions most proximal to the transcription start site (TSS), central to the formation of pre-initiation complexes and for combinatorial gene regulation. The DNA elements required for core promoter function in plants are poorly understood. To establish the sequence motifs that characterize plant core promoters and to compare them to the corresponding sequences in animals, we took advantage of available full-length cDNAs (FL-cDNAs) and predicted upstream regulatory sequences to carry out the analysis of 12,749 Arabidopsis core promoters.
Results: Using a combination of expectation maximization and Gibbs sampling methods, we identified several motifs overrepresented in Arabidopsis core promoters. One of them corresponded to the TATA element, for which an in-depth analysis resulted in the generation of robust TATA Nucleotide Frequency Matrices (NFMs) capable of predicting Arabidopsis TATA elements with a high degree of confidence. We established that approximately 29% of all Arabidopsis promoters contain TATA motifs, clustered around position -32 with respect to the TSS. The presence of TATA elements was associated with genes represented more frequently in EST collections and with shorter 5' UTRs. No cis-elements were found over-represented in TATA-less, compared to TATA-containing promoters.
Conclusion: Our studies provide a first genome-wide illustration of the composition and structure of core Arabidopsis promoters. The percentage of TATA-containing promoters is much lower than commonly recognized, yet comparable to the number of Drosophila promoters containing a TATA element. Although several other DNA elements were identified as over-represented in Arabidopsis promoters, they are present in only a small fraction of the genes and they represent elements not previously described in animals, suggesting a distinct architecture of the core promoters of plant and animal genes.