Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Dec;10(12):1185-91.
doi: 10.1038/nmeth.2722. Epub 2013 Nov 3.

Systematic Evaluation of Spliced Alignment Programs for RNA-seq Data

Collaborators, Affiliations
Free PMC article

Systematic Evaluation of Spliced Alignment Programs for RNA-seq Data

Pär G Engström et al. Nat Methods. .
Free PMC article

Abstract

High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. To assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. In total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Alignment yield.
Shown is the percentage of sequenced or simulated read pairs (fragments) mapped by each protocol. Protocols are grouped by the underlying alignment program (gray shading). Protocol names contain the suffix “ann” if annotation was used. The suffix “cons” distinguishes more conservative protocols from others based on the same aligner. The K562 data set comprises six samples, and the metrics presented here were averaged over them. Source data
Figure 2
Figure 2. Mismatch and truncation frequencies.
(a) Percentage of sequenced reads mapped with the indicated number of mismatches. (b) Percentage of sequenced reads truncated at either or both ends. Bar colors indicate the number of bases removed. Source data
Figure 3
Figure 3. Read placement accuracy for simulated spliced reads.
Source data
Figure 4
Figure 4. Indel frequency and accuracy.
(a) Bars show the size distribution of indels for the human K562 data set. Indel frequencies are tabulated (number of indels per 1,000 sequenced reads). (b) Precision and recall, stratified by indel size, for human simulated data set 1. Source data
Figure 5
Figure 5. Spliced alignment performance.
(a) Frequency and accuracy of splices in primary alignments. Splice frequency was defined as the number of reported splices divided by the number of sequenced reads. For simulated data (center and right), splice recall and false discovery rate (FDR) is presented. Insets show details of the dense upper-left areas (gray rectangles). (b) Number of annotated and novel junctions reported at different thresholds for the number of supporting mappings. In the rightmost plot, filled symbols depict the number of junctions with at least one supporting mapping, and lines demonstrate the result of thresholding. (c) Junction discovery accuracy for simulated data set 1 (top) and 2 (bottom). Counts of true and false junctions were computed at increasing thresholds for the number of supporting mappings, and results were depicted as in b to obtain receiver operating characteristic–like curves. Gray horizontal lines indicate the number of junctions supported by true simulated alignments. (d) Accuracy for the subset of junctions contained in the Ensembl annotation. (e) Accuracy for junctions absent from the Ensembl annotation. Source data
Figure 6
Figure 6. Aligner influence on transcript assembly.
(a,b) Cufflinks performance was assessed by measuring precision and recall for individual exons (a) and spliced transcripts (b). For K562 data, precision was defined as the fraction of predicted exons matching Ensembl annotation, and recall as the fraction of annotated protein-coding gene exons that were predicted. Source data

Comment in

Similar articles

See all similar articles

Cited by 189 articles

See all "Cited by" articles

References

    1. Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. - DOI - PMC - PubMed
    1. Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012;28:3169–3177. doi: 10.1093/bioinformatics/bts605. - DOI - PubMed
    1. Marco-Sola S, Sammeth M, Guigó R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods. 2012;9:1185–1188. doi: 10.1038/nmeth.2221. - DOI - PubMed
    1. Wang K, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 2010;38:e178. doi: 10.1093/nar/gkq622. - DOI - PMC - PubMed
    1. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. doi: 10.1093/bioinformatics/btp120. - DOI - PMC - PubMed

Publication types

Feedback