Sequence-based heuristics for faster annotation of non-coding RNA families

Bioinformatics. 2006 Jan 1;22(1):35-9. doi: 10.1093/bioinformatics/bti743. Epub 2005 Nov 2.

Abstract

Motivation: Non-coding RNAs (ncRNAs) are functional RNA molecules that do not code for proteins. Covariance Models (CMs) are a useful statistical tool to find new members of an ncRNA gene family in a large genome database, using both sequence and, importantly, RNA secondary structure information. Unfortunately, CM searches are extremely slow. Previously, we created rigorous filters, which provably sacrifice none of a CM's accuracy, while making searches significantly faster for virtually all ncRNA families. However, these rigorous filters make searches slower than heuristics could be.

Results: In this paper we introduce profile HMM-based heuristic filters. We show that their accuracy is usually superior to heuristics based on BLAST. Moreover, we compared our heuristics with those used in tRNAscan-SE, whose heuristics incorporate a significant amount of work specific to tRNAs, where our heuristics are generic to any ncRNA. Performance was roughly comparable, so we expect that our heuristics provide a high-quality solution that--unlike family-specific solutions--can scale to hundreds of ncRNA families.

Availability: The source code is available under GNU Public License at the supplementary web site.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Genome
  • Humans
  • Markov Chains
  • Models, Statistical
  • Nucleic Acid Conformation
  • Protein Structure, Secondary
  • Proteins / chemistry
  • RNA / chemistry
  • RNA, Transfer / chemistry
  • RNA, Untranslated / chemistry*
  • ROC Curve
  • Sensitivity and Specificity
  • Sequence Alignment / methods*
  • Software

Substances

  • Proteins
  • RNA, Untranslated
  • RNA
  • RNA, Transfer