A nucleotide composition constraint of genome sequences

Comput Biol Chem. 2004 Apr;28(2):149-53. doi: 10.1016/j.compbiolchem.2004.02.002.


Let a, c, g and t denote the occurrence frequencies of A, C, G and T, respectively, in a genome. We calculated the statistical quantity S = a2 + c2 + g2 + t2 for each of 809 genomes (11 archaea, 42 bacteria, 3 eukaryota, 90 phages, 36 viroids and 627 viruses) and 236 plasmids. We found that S < 1/3 is strictly valid for almost all of the above genomes or plasmids. As a direct deduction of the above observation, it is shown that (i) the statistical quantity S is a kind of genome order index, which is negatively correlated with the Shannon H function; (ii) S < 1/3 suggests that a minimal value of the Shannon H function is required for each genome; (iii) S defined above would be a new biological statistical quantity, useful to describe the composition features of genomes; (iv) By jointly considering the Chargaff Parity Rule 2, it is shown that the genomic G + C content should be in between 0.211 and 0.789.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Archaea
  • Bacteria
  • Base Composition
  • DNA / classification
  • DNA / genetics
  • Eukaryotic Cells
  • Fungi
  • Genome*
  • Humans
  • Nucleotides / chemistry
  • Nucleotides / genetics*
  • Plasmids / genetics
  • Sequence Analysis
  • Statistics as Topic
  • Viruses


  • Nucleotides
  • DNA