Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 13, 238

Mapping Single Molecule Sequencing Reads Using Basic Local Alignment With Successive Refinement (BLASR): Application and Theory

Affiliations

Mapping Single Molecule Sequencing Reads Using Basic Local Alignment With Successive Refinement (BLASR): Application and Theory

Mark J Chaisson et al. BMC Bioinformatics.

Abstract

Background: Recent methods have been developed to perform high-throughput sequencing of DNA by Single Molecule Sequencing (SMS). While Next-Generation sequencing methods may produce reads up to several hundred bases long, SMS sequencing produces reads up to tens of kilobases long. Existing alignment methods are either too inefficient for high-throughput datasets, or not sensitive enough to align SMS reads, which have a higher error rate than Next-Generation sequencing.

Results: We describe the method BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands of bases long, with divergence between the read and genome dominated by insertion and deletion error. The method is benchmarked using both simulated reads and reads from a bacterial sequencing project. We also present a combinatorial model of sequencing error that motivates why our approach is effective.

Conclusions: The results indicate that it is possible to map SMS reads with high accuracy and speed. Furthermore, the inferences made on the mapability of SMS reads using our combinatorial model of sequencing error are in agreement with the mapping accuracy demonstrated on simulated reads.

Figures

Figure 1
Figure 1
An illustration of relationships between alignment methods. The applications / corresponding computational restrictions shown are (green) short pairwise alignment / detailed edit model; (yellow) database search / divergent homology detection; (red) whole genome alignment / alignment of long sequences with structural rearrangements; and (blue) short read mapping / rapid alignment of massive numbers of short sequences. Although solely illustrative, methods with more similar data structures or algorithmic approaches are on closer branches. The BLASR method combines data structures from short read alignment with optimization methods from whole genome alignment.
Figure 2
Figure 2
The distribution of lengths of error-free segments of reads. The line fitted to the points weighted by frequency has slope −0.071, corresponding to a geometric distribution with parameter 0.848, in close agreement with the 84.5% accuracy of the dataset used. Over 95% of segments are of length 20 less.
Figure 3
Figure 3
Waiting length to sequence a word of lengthkatε = 0.05. The waiting lengths to sequence a word of length ≥ k at ε = 0.05 at varrying accuracy. This gives an estimate of the number of bases required to sequence before having an error free stretch that may serve as an alignment anchor.
Figure 4
Figure 4
Values forNumConfigurations(M,N,K,L)/LMfor parameters similar to SMS sequencing. The fraction of configurations allowing at least N anchors of length 15, 20, and 25 for N between 0 and 50 are shown for a 1000 base read when placing (A) 200, (B) 150, (C) 100, and (D) 50 errors.
Figure 5
Figure 5
S-similar sequences measured in the human genome. 1 million query intervals, each 1000 bases long, were randomly sampled from the genome. Each query interval was searched against the human genome to determine the number of non-overlapping 1000 base intervals in the genome that are ≥S-similar to the query. The cumulative distribution for the number of target intervals that are (A) ≥1-similar, (B) ≥5-similar, (C) ≥10-similar, and (D) ≥20-similar to these 1 million query intervals, is shown. Each panel uses minimum anchor lengths k = 15, 20, and 25 and indel rate δ = 0.15. From this, one may interpret the number of intervals that must be searched when mapping a read using anchors. For example, when mapping with a minimum of a single 25 base match, 80% of the queries match to 100 other intervals in the genome with at least one one 25 base match (point X). On the other extreme, the top 3% of queries map to over 1 million other with at least one matchpoint Y), due to the high repeat content of the genome. This indicates that 80% of sequences may be correctly mapped to the human genome using a single 25 base match by only searching 100 100 candidates, however for full sensitivity many more candidates must be searched. Points P and Q show a contrast of the fraction of intervals that have 100 or fewer matches in the genome when matching using 1 or more anchors versus 20 or more anchors, for an anchor length of 15. Only 20% of the samples are limited to 100 or fewer additional matching intervals with at least 1 anchor (point P), and 97.5% of the samples have 100 or fewer matches when requiring at least 20 anchors in a match (point Q).
Figure 6
Figure 6
The mapability of simulated sequences from theE. coli,A. thaliana, and human genomes. Mapping accuracy is shown on a Phred scale ( 10logmissing+mismappedtotal) for all three plots. Reads were simulated with base accuracies 1−ρ = 80%, 85%, … , 100%. In the fraction ρof positions that are erroneous, we simulated 10% substitutions, 62% insertions, and 28% deletions. Missing values have no mismapped reads.
Figure 7
Figure 7
Statistics of reads fromE. coliO104:H4 produced by the PacBioRSsequencing platform. (Black) The fraction of reads with length at least x. This is roughly the survival curve of an exponential distribution. (Blue) The fraction of reads (of length at least x) that are correct at position x. Accuracy is nearly position independent, so the blue curve is roughly the constant 1−ρ, where ρis the error rate per position.
Figure 8
Figure 8
Mapping quality values of reads simulated from the human genome. (A) The frequency of quality values for alignments of 106simulated 1000, 2000, and 3000 base sequences from the human genome. (B) The empirical mapping quality values of the alignments.
Figure 9
Figure 9
Overview of the BLASR method. (A) Candidate intervals are found by mapping short, exact matches as shown by colored arrows. Either a suffix array or BWT-FM index of the genome are used to find the exact matches. Intervals are defined over clusters of matches and are ranked; intervals with score 3, 6, and 4 are shown. (B) Matches scoring above a threshold are aligned using sparse dynamic programming on shorter exact matches. (C) Alignments that have a high-scoring sparse-dynamic programming score are realigned by dynamic programming over a subset of cells defined using the sparse dynamic programming alignment as a guide.
Figure 10
Figure 10
Toy example for counting components. A read of length L = 7 with M = 2 errors is shown, with errors in red. In general, M errors splits the read into M + 1 parts, some of which may be null; in this case, the third part is null. For anchor length threshold K = 3 (meaning parts of size >3 are anchors, parts of size ≤3 are not), we have N = 1 anchor (the first part).
Figure 11
Figure 11
The fraction of configurations with exactly and at leastNanchors. (A) Plot of the fraction of configurations with exactly N anchors, cM,N,K(L)/LM, as N varies. An anchor is a run of at least K correct bases (shown for K = 15, 20, and 25). We assume the read length is L = 1000 and the error rate per base is ρ = 15%(and that there are exactly M = 150 error positions). The solid markers are computed by finding exact coefficients cM,N,K(L) in the generating functions. The curve is a normal distribution approximating the exact values (illustrating Theorem A3), where parameters μand σ2are computed by Theorem A2. (B) The solid markers are a plot of NumConfigurations(M,N,K,L)/LM, the fraction of configurations with at least N anchors, as N varies. The parameters are the same as for (A). The curve is the survival function of the normal distribution in (A).

Similar articles

See all similar articles

Cited by 382 articles

See all "Cited by" articles

References

    1. Smith T, Waterman M. Identification of Common Molecular Subsequences. J Mol Biol. 1981;147:194–197. - PubMed
    1. Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7:203–214. doi: 10.1089/10665270050081478. - DOI - PubMed
    1. Kent W. BLAT–the BLAST-like alignment tool. Genome Res. 2002;12:656–664. - PMC - PubMed
    1. Langmead B, Trapnell C, Pop M, Salzberg S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. - DOI - PMC - PubMed
    1. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. - DOI - PMC - PubMed

LinkOut - more resources

Feedback