Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 25 (13), 1609-16

A Practical Algorithm for Finding Maximal Exact Matches in Large Sequence Datasets Using Sparse Suffix Arrays

Affiliations

A Practical Algorithm for Finding Maximal Exact Matches in Large Sequence Datasets Using Sparse Suffix Arrays

Zia Khan et al. Bioinformatics.

Abstract

Motivation: High-throughput sequencing technologies place ever increasing demands on existing algorithms for sequence analysis. Algorithms for computing maximal exact matches (MEMs) between sequences appear in two contexts where high-throughput sequencing will vastly increase the volume of sequence data: (i) seeding alignments of high-throughput reads for genome assembly and (ii) designating anchor points for genome-genome comparisons.

Results: We introduce a new algorithm for finding MEMs. The algorithm leverages a sparse suffix array (SA), a text index that stores every K-th position of the text. In contrast to a full text index that stores every position of the text, a sparse SA occupies much less memory. Even though we use a sparse index, the output of our algorithm is the same as a full text index algorithm as long as the space between the indexed suffixes is not greater than a minimum length of a MEM. By relying on partial matches and additional text scanning between indexed positions, the algorithm trades memory for extra computation. The reduced memory usage makes it possible to determine MEMs between significantly longer sequences.

Availability: Source code for the algorithm is available under a BSD open source license at http://compbio.cs.princeton.edu/mems. The implementation can serve as a drop-in replacement for the MEMs algorithm in MUMmer 3.

Figures

Fig. 1.
Fig. 1.
The suffix indexes of the reference text S = mississippi$ listed in order (left). The SA is an array of integers where these indices are listed in lexicographical order. LCP and ISA designate the longest common prefix (LCP) array and inverse SA, respectively (see text) (middle). Search for the occurrences of P = iss in the SA by a top-down search one character at a time (right).
Fig. 2.
Fig. 2.
Sparse SA example. The sparse suffix indexes for K = 2 of the reference text S= mississippi$ listed in order (left). Compare to Figure 1. The corresponding sparse SA, ISA and LCP arrays (middle). Search for the occurrences of P = iss in the sparse SA locates only one string match (right).
Fig. 3.
Fig. 3.
(a) Partial matches at successive locations in the query P = issxiss and K = 2 sparsely indexed string S = mississippi$. The asterisk indicates an indexed position and the numbers in the left column designate positions in the query P. A match of ‘ss’ can be used to recover the MEM ‘iss’ by scanning left of the match and checking for left maximality. (b) Here, we consider another query P = ississ to find MEMs of length ≥4 in the full text K = 1 SA. At the initial position in the query p = 0, the interval of a right maximal match of length 6 at position 1 in the reference is found by top-down search. Examining neighboring LCP values, the algorithm ‘unmatches’ the characters ‘iss’ to find a second right maximal match ‘issi’ of length 4.
Fig. 4.
Fig. 4.
Suffix link simulation for the full text K = 1, example in Figure 1. Top-down binary search for the query P = ‘is’ narrows down the interval [3..4] (left). From [3..4], we can use the ISA for K = 1 to compute a new interval l = ISA[SA[l] + 1]=10 and r=ISA[SA[r]+1]=11(middle). However, the interval does not correspond to the interval [8..11], obtained by top-down search of the single character P = ‘s’ query. The interval is obtained by expanding the left side of the interval using values ≥1 (in bold) in the LCP array (right).
Fig. 5.
Fig. 5.
Effect of increasing values of K on memory usage and MEM computation time using unpaired 454 reads from thae 1000 Genomes Project and the 3.1 Gbp human genome (hg18). For this evaluation L = 100, we computed MEMs of length ≥100. Total memory usage in gigabytes on a 4-core 2 GHz Intel Xeon CPU with 16 GB of RAM (left) and corresponding average per read computation time in milliseconds (right).

Similar articles

See all similar articles

Cited by 15 PubMed Central articles

See all "Cited by" articles

Publication types

Feedback