Gene recognition from questionable ORFs in bacterial and archaeal genomes

J Biomol Struct Dyn. 2003 Aug;21(1):99-109. doi: 10.1080/07391102.2003.10506908.


The ORFs of microbial genomes in annotation files are usually classified into two groups: the first corresponds to known genes; whereas the second includes 'putative', 'probable', 'conserved hypothetical', 'hypothetical', 'unknown' and 'predicted' ORFs etc. Since the annotation is not 100% accurate, it is essential to confirm which ORF of the latter group is coding and which is not. Starting from known genes in the former, this paper describes an improved Z curve method to recognize genes in the latter. Ten-fold cross-validation tests show that the average accuracy of the algorithm is greater than 99% for recognizing the known genes in 57 bacterial and archaeal genomes. The method is then applied to recognize genes of the latter group. The likely non-coding ORFs in each of the 57 bacterial or archaeal genomes studied here are recognized and listed at the website The working mechanism of the algorithm has been discussed in details. A computer program, called ZCURVE_C, was written to calculate a coding score called Z-curve score for ORFs in the above 57 bacterial and archaeal genomes. Coding/non-coding is simply determined by the criterion of Z-curve score > 0/ Z-curve score < 0. A website has been set up to provide the service to calculate the Z-curve score. A user may submit the DNA sequence of an ORF to the server at, and the Z-curve score of the ORF is calculated and returned to the user immediately.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Base Composition
  • Base Sequence
  • Chromosomes, Bacterial / chemistry*
  • Chromosomes, Bacterial / genetics
  • Genes*
  • Genome, Archaeal*
  • Genome, Bacterial*
  • Internet
  • Open Reading Frames / genetics*
  • Reproducibility of Results
  • Software