Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale

BMC Bioinformatics. 2015 Jul 24:16:227. doi: 10.1186/s12859-015-0654-5.

Abstract

Background: With rapid advancements in technology, the sequences of thousands of species' genomes are becoming available. Within the sequences are repeats that comprise significant portions of genomes. Successful annotations thus require accurate discovery of repeats. As species-specific elements, repeats in newly sequenced genomes are likely to be unknown. Therefore, annotating newly sequenced genomes requires tools to discover repeats de-novo. However, the currently available de-novo tools have limitations concerning the size of the input sequence, ease of use, sensitivities to major types of repeats, consistency of performance, speed, and false positive rate.

Results: To address these limitations, I designed and developed Red, applying Machine Learning. Red is the first repeat-detection tool capable of labeling its training data and training itself automatically on an entire genome. Red is easy to install and use. It is sensitive to both transposons and simple repeats; in contrast, available tools such as RepeatScout and ReCon are sensitive to transposons, and WindowMasker to simple repeats. Red performed consistently well on seven genomes; the other tools performed well only on some genomes. Red is much faster than RepeatScout and ReCon and has a much lower false positive rate than WindowMasker. On human genes with five or more copies, Red was more specific than RepeatScout by a wide margin. When tested on genomes of unusual nucleotide compositions, Red located repeats with high sensitivities and maintained moderate false positive rates. Red outperformed the related tools on a bacterial genome. Red identified 46,405 novel repetitive segments in the human genome. Finally, Red is capable of processing assembled and unassembled genomes.

Conclusions: Red's innovative methodology and its excellent performance on seven different genomes represent a valuable advancement in the field of repeats discovery.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • DNA / chemistry
  • DNA / metabolism
  • Genome, Bacterial
  • Genome, Human
  • Genome, Plant
  • Genomics / methods*
  • Humans
  • Interspersed Repetitive Sequences / genetics
  • Markov Chains
  • Plants / genetics
  • Software*

Substances

  • DNA