Using a Euclid distance discriminant method to find protein coding genes in the yeast genome

Comput Chem. 2002 Feb;26(3):195-206. doi: 10.1016/s0097-8485(01)00107-3.


The Euclid distance discriminant method is used to find protein coding genes in the yeast genome, based on the single nucleotide frequencies at three codon positions in the ORFs. The method is extremely simple and may be extended to find genes in prokaryotic genomes or eukaryotic genomes with less introns. Six-fold cross-validation tests have demonstrated that the accuracy of the algorithm is better than 93%. Based on this, it is found that the total number of protein coding genes in the yeast genome is less than or equal to 5579 only, about 3.8-7.0% less than 5800-6000, which is currently widely accepted. The base compositions at three codon positions are analyzed in details using a graphic method. The result shows that the preference codons adopted by yeast genes are of the RGW type, where R, G and W indicate the bases of purine, non-G and A/T, whereas the 'codons' in the intergenic sequences are of the form NNN, where N denotes any base. This fact constitutes the basis of the algorithm to distinguish between coding and non-coding ORFs in the yeast genome. The names of putative non-coding ORFs are listed here in detail.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Databases, Nucleic Acid
  • Discriminant Analysis
  • Fungal Proteins / genetics*
  • Genes, Fungal
  • Genome, Fungal*
  • Open Reading Frames
  • Saccharomyces cerevisiae / genetics*
  • Sensitivity and Specificity


  • Fungal Proteins