Background: Bacteria show a bias in their genomic oligonucleotide composition far beyond that dictated by G+C content. Patterns of over- and underrepresented oligonucleotides carry a phylogenetic signal and are thus diagnostic for individual species. Patterns of short oligomers have been investigated by multiple groups in large numbers of bacteria genomes. However, global distributions of the most highly overrepresented mid-sized oligomers have not been assessed across all prokaryotes to date. We surveyed overrepresented mid-length oligomers across all prokaryotes and normalised for base composition and embedded oligomers using zero and second order Markov models.
Principal findings: Here we report a presumably ancient set of oligomers conserved and overrepresented in nearly all branches of prokaryotic life, including Archaea. These oligomers are either adenine rich homopurines with one to three guanine nucleosides, or homopyridimines with one to four cytosine nucleosides. They do not show a consistent preference for coding or non-coding regions or aggregate in any coding frame, implying a role in DNA structure and as polypeptide binding sites. Structural parameters indicate these oligonucleotides to be an extreme and rigid form of B-DNA prone to forming triple stranded helices under common physiological conditions. Moreover, the narrow minor grooves of these structures are recognised by DNA binding and nucleoid associated proteins such as HU.
Conclusion: Homopurine and homopyrimidine oligomers exhibit distinct and unusual structural features and are present at high copy number in nearly all prokaryotic lineages. This fact suggests a non-neutral role of these oligonucleotides for bacterial genome organization that has been maintained throughout evolution.