Analysis on the distribution of bases in 1487 human protein coding sequences

J Theor Biol. 1994 Mar 21;167(2):161-6. doi: 10.1006/jtbi.1994.1060.


The occurrence frequencies of bases A, C, G and T, denoted by a, c, g and t, respectively, in 1487 human protein coding sequences have been calculated and analyzed. The analysis has been performed by a diagrammatic method presented recently, in which each coding sequence is represented by a point in 3-D space. The distribution of points gives the observer an overall and intuitive picture of the base frequencies. The distance between a point and the origin of the co-ordinate, which corresponds to the case of a = c = g = t = 1/4, is called the radical distance. The radical distribution of 1487 points in 3-D space has been found to be normal, with the center basically coinciding with the origin of the co-ordinate. We have found that among 1487 coding sequences, an empirical rule a2 + c2 + g2 + t2 < 1/3 holds for 1486 sequences. The only sequence in which the above rule does not hold is the one coding for the human parathymosin protein. The composition of amino acids and the structural class of this protein has been studied in some detail.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Base Sequence
  • Codon / genetics
  • Humans
  • Models, Genetic*
  • Proteins / genetics*


  • Codon
  • Proteins