JIGSAW: integration of multiple sources of evidence for gene prediction

Bioinformatics. 2005 Sep 15;21(18):3596-603. doi: 10.1093/bioinformatics/bti609. Epub 2005 Aug 2.

Abstract

Motivation: Computational gene finding systems play an important role in finding new human genes, although no systems are yet accurate enough to predict all or even most protein-coding regions perfectly. Ab initio programs can be augmented by evidence such as expression data or protein sequence homology, which improves their performance. The amount of such evidence continues to grow, but computational methods continue to have difficulty predicting genes when the evidence is conflicting or incomplete. Genome annotation pipelines collect a variety of types of evidence about gene structure and synthesize the results, which can then be refined further through manual, expert curation of gene models.

Results: JIGSAW is a new gene finding system designed to automate the process of predicting gene structure from multiple sources of evidence, with results that often match the performance of human curators. JIGSAW computes the relative weight of different lines of evidence using statistics generated from a training set, and then combines the evidence using dynamic programming. Our results show that JIGSAW's performance is superior to ab initio gene finding methods and to other pipelines such as Ensembl. Even without evidence from alignment to known genes, JIGSAW can substantially improve gene prediction accuracy as compared with existing methods.

Availability: JIGSAW is available as an open source software package at http://cbcb.umd.edu/software/jigsaw.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Algorithms
  • Animals
  • Codon
  • Computational Biology / instrumentation*
  • Computational Biology / methods*
  • DNA, Complementary / metabolism
  • Databases, Factual
  • Databases, Genetic
  • Gene Expression Profiling
  • Genes, Fungal
  • Genes, Plant
  • Genome, Human*
  • Humans
  • Introns
  • Markov Chains
  • Models, Genetic
  • Models, Statistical
  • Open Reading Frames
  • Proteins / chemistry
  • Sequence Alignment
  • Sequence Analysis, DNA
  • Sequence Analysis, Protein
  • Software
  • Software Validation

Substances

  • Codon
  • DNA, Complementary
  • Proteins