The Ensembl automatic gene annotation system

Genome Res. 2004 May;14(5):942-50. doi: 10.1101/gr.1858004.


As more genomes are sequenced, there is an increasing need for automated first-pass annotation which allows timely access to important genomic information. The Ensembl gene-building system enables fast automated annotation of eukaryotic genomes. It annotates genes based on evidence derived from known protein, cDNA, and EST sequences. The gene-building system rests on top of the core Ensembl (MySQL) database schema and Perl Application Programming Interface (API), and the data generated are accessible through the Ensembl genome browser ( To date, the Ensembl predicted gene sets are available for the A. gambiae, C. briggsae, zebrafish, mouse, rat, and human genomes and have been heavily relied upon in the publication of the human, mouse, rat, and A. gambiae genome sequence analysis. Here we describe in detail the gene-building system and the algorithms involved. All code and data are freely available from

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Animals
  • Anopheles / genetics
  • Automation*
  • Caenorhabditis / genetics
  • Computational Biology / methods*
  • DNA / genetics
  • DNA, Helminth / genetics
  • Expressed Sequence Tags
  • Gene Dosage
  • Genes / physiology*
  • Genes, Helminth / physiology
  • Genes, Insect / physiology
  • Genome
  • Genome, Human
  • Helminth Proteins / genetics
  • Humans
  • Insect Proteins / genetics
  • Mice
  • Predictive Value of Tests
  • Proteins / genetics
  • Pseudogenes / genetics
  • Rats
  • Sequence Alignment / methods
  • Sequence Homology, Amino Acid
  • Software
  • Tandem Repeat Sequences / genetics
  • Untranslated Regions / genetics


  • DNA, Helminth
  • Helminth Proteins
  • Insect Proteins
  • Proteins
  • Untranslated Regions
  • DNA