Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 25 (14), 1754-60

Fast and Accurate Short Read Alignment With Burrows-Wheeler Transform

Affiliations

Fast and Accurate Short Read Alignment With Burrows-Wheeler Transform

Heng Li et al. Bioinformatics.

Abstract

Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals.

Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows-Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is approximately 10-20x faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package.

Availability: http://maq.sourceforge.net.

Figures

Fig. 1.
Fig. 1.
Prefix trie of string ‘GOOGOL’. Symbol ∧ marks the start of the string. The two numbers in a node give the SA interval of the string represented by the node (see Section 2.3). The dashed line shows the route of the brute-force search for a query string ‘LOL’, allowing at most one mismatch. Edge labels in squares mark the mismatches to the query in searching. The only hit is the bold node [1, 1] which represents string ‘GOL’.
Fig. 2.
Fig. 2.
Constructing suffix array and BWT string for X=googol$. String X is circulated to generate seven strings, which are then lexicographically sorted. After sorting, the positions of the first symbols form the suffix array (6, 3, 0, 5, 2, 4, 1) and the concatenation of the last symbols of the circulated strings gives the BWT string lo$oogg.
Fig. 3.
Fig. 3.
Algorithm for inexact search of SA intervals of substrings that match W. Reference X is $ terminated, while W is A/C/G/T terminated. Procedure InexactSearch(W, z) returns the SA intervals of substrings that match W with no more than z differences (mismatches or gaps); InexRecur(W, i, z, k, l) recursively calculates the SA intervals of substrings that match W[0, i] with no more than z differences on the condition that suffix Wi+1 matches interval [k, l]. Lines started with asterisk are for insertions to and deletions from X, respectively. D(i) is the lower bound of the number of differences in string W[0, i].
Fig. 4.
Fig. 4.
Equivalent algorithm to calculate D(i).

Similar articles

  • Fast and Accurate Long-Read Alignment With Burrows-Wheeler Transform
    H Li et al. Bioinformatics 26 (5), 589-95. PMID 20080505.
    We designed and implemented a new algorithm, Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence data …
  • Short Read Alignment Using SOAP2
    B Hurgobin. Methods Mol Biol 1374, 241-52. PMID 26519410.
    Next-generation sequencing (NGS) technologies have rapidly evolved in the last 5 years, leading to the generation of millions of short reads in a single run. Consequently …
  • Ψ-RA: A Parallel Sparse Index for Genomic Read Alignment
    M Oğuzhan Külekci et al. BMC Genomics 12 Suppl 2 (Suppl 2), S7. PMID 21989248.
    Ψ-RA is expected to serve as a valuable tool in the alignment of short reads generated by the next generation high-throughput sequencing technology. Ψ-RA is very fast in …
  • A Survey of Sequence Alignment Algorithms for Next-Generation Sequencing
    H Li et al. Brief Bioinform 11 (5), 473-83. PMID 20460430. - Review
    Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence r …
  • Sense From Sequence Reads: Methods for Alignment and Assembly
    P Flicek et al. Nat Methods 6 (11 Suppl), S6-S12. PMID 19844229. - Review
    The most important first step in understanding next-generation sequencing data is the initial alignment or assembly that determines whether an experiment has succeeded an …
See all similar articles

Cited by 11,389 PubMed Central articles

See all "Cited by" articles

References

    1. Burrows M, Wheeler DJ. Technical report 124. Palo Alto, CA: Digital Equipment Corporation; 1994. A block-sorting lossless data compression algorithm.
    1. Campagna D, et al. PASS: a program to align short sequences. Bioinformatics. 2009;25:967–968. - PubMed
    1. Eaves HL, Gao Y. MOM: maximum oligonucleotide mapping. Bioinformatics. 2009;25:969–970. - PubMed
    1. Ferragina P, Manzini G. Proceedings of the 41st Symposium on Foundations of Computer Science (FOCS 2000) IEEE Computer Society; 2000. Opportunistic data structures with applications; pp. 390–398.
    1. Grossi R, Vitter JS. Proceedings on 32nd Annual ACM Symposium on Theory of Computing (STOC 2000) ACM; 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching; pp. 397–406.

Publication types

Feedback