Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Oct 1;28(19):2417-24.
doi: 10.1093/bioinformatics/bts456. Epub 2012 Jul 24.

YAHA: fast and flexible long-read alignment with optimal breakpoint detection

Affiliations

YAHA: fast and flexible long-read alignment with optimal breakpoint detection

Gregory G Faust et al. Bioinformatics. .

Abstract

Motivation: With improved short-read assembly algorithms and the recent development of long-read sequencers, split mapping will soon be the preferred method for structural variant (SV) detection. Yet, current alignment tools are not well suited for this.

Results: We present YAHA, a fast and flexible hash-based aligner. YAHA is as fast and accurate as BWA-SW at finding the single best alignment per query and is dramatically faster and more sensitive than both SSAHA2 and MegaBLAST at finding all possible alignments. Unlike other aligners that report all, or one, alignment per query, or that use simple heuristics to select alignments, YAHA uses a directed acyclic graph to find the optimal set of alignments that cover a query using a biologically relevant breakpoint penalty. YAHA can also report multiple mappings per defined segment of the query. We show that YAHA detects more breakpoints in less time than BWA-SW across all SV classes, and especially excels at complex SVs comprising multiple breakpoints.

Availability: YAHA is currently supported on 64-bit Linux systems. Binaries and sample data are freely available for download from http://faculty.virginia.edu/irahall/YAHA.

Contact: imh4y@virginia.edu.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
(A) Starting at each location in the query, we form a k-mer which is then converted to a hash key by compressing the bases in the k-mer using 2-bits per base. That hash key is then used to directly index into the Hash Array, giving the starting offset and length of the subset of the ROA that contains the collection of reference locations for that k-mer. (B) Next, seed matches from the query and reference that fall along the same diagonal are collected into extended seeds called ‘fragments’ by merging the pre-sorted ROA regions for each query location using a Binary Heap. (C) In any given region of the reference, many fragments can be included in a potential alignment. YAHA uses a graph algorithm to find the set that maximizes the estimated score. In this example, fragments 1, 2 and 4 form the best alignment. (D) During the Optimal Query Coverage algorithm, we will find the best collection of ‘primary’ alignments (green lines) that has the highest non-overlapping sum of scores. Filter By Similarity is then used to determine the remaining ‘secondary’ alignments (blue lines) that are highly similar to any primary alignment. The remaining alignments (red lines) are not included in the output for the query.
Fig. 2
Fig. 2
Histogram of the number of queries in the Y1 YAHA run with varying numbers of greater, equal and fewer GE50U alignments than MegaBLAST (M). Note the log10 scale bucket sizes. The total number of queries above 0 is 30 638 and below 0 is 1005 as in Table 1.
Fig. 3
Fig. 3
Shown are graphs of the percentage of queries with which each aligner correctly verified an SV breakpoint for various types of SV events versus the amount of CPU time consumed. Note the large improvement with the inclusion of YAHA’s secondary alignments in the Alu dataset. Also note the marked improvement for both BWA-SW and YAHA in the CGR dataset with 4% error rate by changing the AGS parameters to lower the penalty for indels relative to replacements. Still, YAHA outperforms BWA-SW with both sets of AGS parameters. Graphs C and D are shown with the same axes to ease comparison.

Similar articles

Cited by

References

    1. Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Bailey JA, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–1007. - PubMed
    1. Eid J, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–138. - PubMed
    1. Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982;162:705–708. - PubMed
    1. Hormozdiari F, et al. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 2009;19:1270–1278. - PMC - PubMed

Publication types