Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Oct 30;9:509.
doi: 10.1186/1471-2164-9-509.

Sequence Space Coverage, Entropy of Genomes and the Potential to Detect Non-Human DNA in Human Samples

Free PMC article

Sequence Space Coverage, Entropy of Genomes and the Potential to Detect Non-Human DNA in Human Samples

Zhandong Liu et al. BMC Genomics. .
Free PMC article


Background: Genomes store information for building and maintaining organisms. Complete sequencing of many genomes provides the opportunity to study and compare global information properties of those genomes.

Results: We have analyzed aspects of the information content of Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, Saccharomyces cerevisiae, and Escherichia coli (K-12) genomes. Virtually all possible (> 98%) 12 bp oligomers appear in vertebrate genomes while < 2% of 19 bp oligomers are present. Other species showed different ranges of > 98% to < 2% of possible oligomers in D. melanogaster (12-17 bp), C. elegans (11-17 bp), A. thaliana (11-17 bp), S. cerevisiae (10-16 bp) and E. coli (9-15 bp). Frequencies of unique oligomers in the genomes follow similar patterns. We identified a set of 2.6 M 15-mers that are more than 1 nucleotide different from all 15-mers in the human genome and so could be used as probes to detect microbes in human samples. In a human sample, these probes would detect 100% of the 433 currently fully sequenced prokaryotes and 75% of the 3065 fully sequenced viruses. The human genome is significantly more compact in sequence space than a random genome. We identified the most frequent 5- to 20-mers in the human genome, which may prove useful as PCR primers. We also identified a bacterium, Anaeromyxobacter dehalogenans, which has an exceptionally low diversity of oligomers given the size of its genome and its GC content. The entropy of coding regions in the human genome is significantly higher than non-coding regions and chromosomes. However chromosomes 1, 2, 9, 12 and 14 have a relatively high proportion of coding DNA without high entropy, and chromosome 20 is the opposite with a low frequency of coding regions but relatively high entropy.

Conclusion: Measures of the frequency of oligomers are useful for designing PCR assays and for identifying chromosomes and organisms with hidden structure that had not been previously recognized. This information may be used to detect novel microbes in human tissues.


Figure 1
Figure 1
The percentage of all possible n-mers (coverage) that appear in H. sapien, M. musculus, D. melanogaster, C. elegans, A. thaliana, S. cerevisiae, E. coli k12, theoretical and pseudo-human genomes. Theo-human is the maximum coverage a human-length genome could achieve if every n-mer in its genome was unique. The pseudo-human (pseudo-hs) genome is a random genome generated with the same length and dinucleotide frequencies of the human genome. The space coverage of each genome listed above is plotted against the length of the oligomer analyzed, ranging from 1 to 20.
Figure 2
Figure 2
(a) Coverage of 10-mer sequence space as a function of genome size in 433 fully sequenced microbial genomes. The legend for the color-coding of GC content appears on the right. Smaller genomes have lower GC content. Anaeromyxobacter dehalogenans is an outlier with unusually low coverage for its genome size and GC content (outside of the 99.9% predicted interval). (b) A histogram for the proportion of the 10-mer sequence space covered by each of the 433 fully sequenced microbial genomes.
Figure 3
Figure 3
The percentage of n-mers that appeared exactly once (unique hits), out of all the n-mers detected in each genome. Slightly less than 50% of 16-mers detected in humans are unique. Whereas, for E. coli, a little more than 50% of 12-mers were unique.
Figure 4
Figure 4
The density of the human genome in sequence space. For every randomly generated n-mer that was detected in the human genome, we generated all single basepair variants (3n variants for each n-mer) and tested them to see if they were also represented in the human genome (1nn). We also generated 3n of the 2 bp variants (2nn), 3n of the 3 bp variants, and so on up to variants that differed in 10 bp from the original human n-mer. The sequences that are only a few SNPs away from the original human n-mer are significantly more likely to be in the human genome compared to a random n-mer (black bars, "random"). This shows that the human genome is relatively compact in sequence space. The standard error for all points is < 0.003.
Figure 5
Figure 5
Entropy rate, using the Lempel-Ziv 77 algorithm, for the coding sequence (red) and the genomic sequence for chromosome 20 (green), as a function of the length of the sequence analyzed. The entropy calculation converges after 10 million bases.
Figure 6
Figure 6
The entropy, or information content (solid line, left Y axis) and percent of the sequence coding for proteins (dashed line, right Y axis, log scale) for each human chromosome as well as the full set of coding regions (CCDS). Given the higher entropy rate of coding regions to non-coding regions, we expect a correlation between the two measurements. However, chromosomes 1, 2, 9, 12, and 14 have a lower information content than might be expected for the percent of those chromosomes occupied by protein coding regions. Chromosome 20 appears to have a higher entropy than would be expected given its gene poor content. This may be a signal of extensive non-protein coding, yet functional RNA on chromosome 20.

Similar articles

  • Detection of periodicity in eukaryotic genomes on the basis of power spectrum analysis.
    Fukushima A, Ikemura T, Oshima T, Mori H, Kanaya S. Fukushima A, et al. Genome Inform. 2002;13:21-9. Genome Inform. 2002. PMID: 14571371
  • Similarities and differences in genome-wide expression data of six organisms.
    Bergmann S, Ihmels J, Barkai N. Bergmann S, et al. PLoS Biol. 2004 Jan;2(1):E9. doi: 10.1371/journal.pbio.0020009. Epub 2003 Dec 15. PLoS Biol. 2004. PMID: 14737187 Free PMC article.
  • The sequence of the human genome.
    Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigó R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X. Venter JC, et al. Science. 2001 Feb 16;291(5507):1304-51. doi: 10.1126/science.1058040. Science. 2001. PMID: 11181995
  • The computational detection of functional nucleotide sequence motifs in the coding regions of organisms.
    Robins H, Krasnitz M, Levine AJ. Robins H, et al. Exp Biol Med (Maywood). 2008 Jun;233(6):665-73. doi: 10.3181/0704-MR-97. Epub 2008 Apr 11. Exp Biol Med (Maywood). 2008. PMID: 18408149 Review.
  • The FLEXGene repository: exploiting the fruits of the genome projects by creating a needed resource to face the challenges of the post-genomic era.
    Brizuela L, Richardson A, Marsischky G, Labaer J. Brizuela L, et al. Arch Med Res. 2002 Jul-Aug;33(4):318-24. doi: 10.1016/s0188-4409(02)00372-7. Arch Med Res. 2002. PMID: 12234520 Review.
See all similar articles

Cited by 6 articles

See all "Cited by" articles


    1. Watson JD. The Double Helix: A Personal Account of the Discovery of the Structure of DNA. Penguin. 1970.
    1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
    1. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. - DOI - PubMed
    1. Li WH, Gu Z, Wang H, Nekrutenko A. Evolutionary analyses of the human genome. Nature. 2001;409:847–849. doi: 10.1038/35057039. - DOI - PubMed
    1. Murphy WJ, Larkin DM, Everts-van der Wind A, Bourque G, Tesler G, Auvil L, Beever JE, Chowdhary BP, Galibert F, Gatzke L, et al. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005;309:613–617. doi: 10.1126/science.1111387. - DOI - PubMed

Publication types

LinkOut - more resources