Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm

Nucleic Acids Res. 2014 Sep;42(15):e119. doi: 10.1093/nar/gku557. Epub 2014 Jul 2.

Abstract

We present a new approach to automatic training of a eukaryotic ab initio gene finding algorithm. With the advent of Next-Generation Sequencing, automatic training has become paramount, allowing genome annotation pipelines to keep pace with the speed of genome sequencing. Earlier we developed GeneMark-ES, currently the only gene finding algorithm for eukaryotic genomes that performs automatic training in unsupervised ab initio mode. The new algorithm, GeneMark-ET augments GeneMark-ES with a novel method that integrates RNA-Seq read alignments into the self-training procedure. Use of 'assembled' RNA-Seq transcripts is far from trivial; significant error rate of assembly was revealed in recent assessments. We demonstrated in computational experiments that the proposed method of incorporation of 'unassembled' RNA-Seq reads improves the accuracy of gene prediction; particularly, for the 1.3 GB genome of Aedes aegypti the mean value of prediction Sensitivity and Specificity at the gene level increased over GeneMark-ES by 24.5%. In the current surge of genomic data when the need for accurate sequence annotation is higher than ever, GeneMark-ET will be a valuable addition to the narrow arsenal of automatic gene prediction tools.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Animals
  • Culicidae / genetics
  • Drosophila melanogaster / genetics
  • Gene Expression Profiling
  • Genes*
  • Genes, Insect
  • High-Throughput Nucleotide Sequencing / methods*
  • Sequence Alignment / methods*
  • Sequence Analysis, RNA / methods*