Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-Seq

PLoS One. 2013 Oct 30;8(10):e76935. doi: 10.1371/journal.pone.0076935. eCollection 2013.

Abstract

Next-generation sequencing technologies provide an unparallelled opportunity for the characterization and discovery of known and novel viruses. Because viruses are known to have the highest mutation rates when compared to eukaryotic and bacterial organisms, we assess the extent to which eleven well-known alignment algorithms (BLAST, BLAT, BWA, BWA-SW, BWA-MEM, BFAST, Bowtie2, Novoalign, GSNAP, SHRiMP2 and STAR) can be used for characterizing mutated and non-mutated viral sequences--including those that exhibit RNA splicing--in transcriptome samples. To evaluate aligners objectively we developed a realistic RNA-Seq simulation and evaluation framework (RiSER) and propose a new combined score to rank aligners for viral characterization in terms of their precision, sensitivity and alignment accuracy. We used RiSER to simulate both human and viral read sequences and suggest the best set of aligners for viral sequence characterization in human transcriptome samples. Our results show that significant and substantial differences exist between aligners and that a digital-subtraction-based viral identification framework can and should use different aligners for different parts of the process. We determine the extent to which mutated viral sequences can be effectively characterized and show that more sensitive aligners such as BLAST, BFAST, SHRiMP2, BWA-SW and GSNAP can accurately characterize substantially divergent viral sequences with up to 15% overall sequence mutation rate. We believe that the results presented here will be useful to researchers choosing aligners for viral sequence characterization using next-generation sequencing data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Computational Biology / methods*
  • Genes, Viral / genetics
  • Genome, Human / genetics
  • Genome, Viral / genetics
  • HIV-1 / genetics
  • Herpesvirus 1, Human / genetics
  • High-Throughput Nucleotide Sequencing / methods
  • Human papillomavirus 18 / genetics
  • Humans
  • Influenza A Virus, H5N1 Subtype / genetics
  • Internet
  • Mutation
  • Reproducibility of Results
  • Sequence Alignment / methods*
  • Sequence Analysis, RNA / methods*
  • Transcriptome / genetics
  • Viruses / genetics*

Grants and funding

The authors received support for their work from the Ontario Institute for Cancer Research (OICR) through funding provided by the government of Ontario. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.