Recognizing shorter coding regions of human genes based on the statistics of stop codons

Biopolymers. 2002 Mar;63(3):207-16. doi: 10.1002/bip.10054.


With the quick progress of the Human Genome Project, a great amount of uncharacterized DNA sequences needs to be annotated copiously by better algorithms. Recognizing shorter coding sequences of human genes is one of the most important problems in gene recognition, which is not yet completely solved. This paper is devoted to solving the issue using a new method. The distributions of the three stop codons, i.e., TAA, TAG and TGA, in three phases along coding, noncoding, and intergenic sequences are studied in detail. Using the obtained distributions and other coding measures, a new algorithm for the recognition of shorter coding sequences of human genes is developed. The accuracy of the algorithm is tested based on a larger database of human genes. It is found that the average accuracy achieved is as high as 92.1% for the sequences with length of 192 base pairs, which is confirmed by sixfold cross-validation tests. It is hoped that by incorporating the present method with some existing algorithms, the accuracy for identifying human genes from unannotated sequences would be increased.

Publication types

  • Comparative Study
  • Evaluation Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Base Sequence
  • Codon, Terminator*
  • DNA / chemistry
  • DNA / genetics
  • DNA, Intergenic
  • Databases, Genetic
  • Fourier Analysis
  • Genome, Human*
  • Humans
  • Introns
  • Markov Chains
  • Mathematics
  • Pseudogenes
  • Purines / chemistry
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Statistics as Topic


  • Codon, Terminator
  • DNA, Intergenic
  • Purines
  • DNA