Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve

Nucleic Acids Res. 2000 Jul 15;28(14):2804-14. doi: 10.1093/nar/28.14.2804.

Abstract

The Z curve is a three-dimensional space curve constituting the unique representation of a given DNA sequence in the sense that each can be uniquely reconstructed from the other. Based on the Z curve, a new protein coding gene-finding algorithm specific for the yeast genome at better than 95% accuracy has been proposed. Six cross-validation tests were performed to confirm the above accuracy. Using the new algorithm, the number of protein coding genes in the yeast genome is re-estimated. The estimate is based on the assumption that the unknown genes have similar statistical properties to the known genes. It is found that the number of protein coding genes in the 16 yeast chromosomes is </=5645, significantly smaller than the 5800-6000 which is widely accepted, and much larger than the 4800 estimated by another group recently. The mitochondrial genes were not included into the above estimate. A codingness index called the YZ score (YZ OE [0,1]) is proposed to recognize protein coding genes in the yeast genome. Among the ORFs annotated in the MIPS (Munich Information Centre for Protein Sequences) database, those recognized as non-coding by the present algorithm are listed in this paper in detail. The criterion for a coding or non-coding ORF is simply decided by YZ > 0.5 or YZ < 0.5, respectively. The YZ scores for all the ORFs annotated in the MIPS database have been calculated and are available on request by sending e-mail to the corresponding author.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • DNA, Fungal
  • Databases as Topic
  • Fungal Proteins / genetics*
  • Genes, Fungal / genetics*
  • Genome, Fungal*
  • Open Reading Frames
  • Reproducibility of Results
  • Saccharomyces cerevisiae / genetics*

Substances

  • DNA, Fungal
  • Fungal Proteins