Gene prediction with a hidden Markov model and a new intron submodel

Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. doi: 10.1093/bioinformatics/btg1080.


Motivation: The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here existing programs tend to predict many false exons.

Results: We have developed a new program, AUGUSTUS, for the ab initio prediction of protein coding genes in eukaryotic genomes. The program is based on a Hidden Markov Model and integrates a number of known methods and submodels. It employs a new way of modeling intron lengths. We use a new donor splice site model, a new model for a short region directly upstream of the donor splice site model that takes the reading frame into account and apply a method that allows better GC-content dependent parameter estimation. AUGUSTUS predicts on longer sequences far more human and drosophila genes accurately than the ab initio gene prediction programs we compared it with, while at the same time being more specific.

Availability: A web interface for AUGUSTUS and the executable program are located at

MeSH terms

  • Algorithms*
  • Animals
  • Artificial Intelligence
  • Base Sequence
  • Computer Simulation
  • Humans
  • Introns / genetics*
  • Markov Chains
  • Models, Genetic*
  • Molecular Sequence Data
  • Pattern Recognition, Automated / methods*
  • RNA Splice Sites / genetics*
  • Sequence Analysis, DNA / methods*


  • RNA Splice Sites