Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Mar;15(2):138-54.
doi: 10.1093/bib/bbt081. Epub 2014 Jan 10.

A bioinformatician's guide to the forefront of suffix array construction algorithms

Affiliations
Free PMC article

A bioinformatician's guide to the forefront of suffix array construction algorithms

Anish Man Singh Shrestha et al. Brief Bioinform. 2014 Mar.
Free PMC article

Abstract

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support 'spaced seeds' and 'subset seeds' used in many biological applications.

Keywords: linear-time algorithm; spaced seeds; subset seeds; suffix array construction; text index.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
A string (above) and its suffix array (shown vertically) along with the position index on the left and the corresponding suffixes to the right.
Figure 2:
Figure 2:
Divide phase of the algorithm by Kim et al. Here, formula image accca$. The set of sampled suffixes formula image{accca$,cca$,a$}. formula image ac; formula image cc; formula imagea$. Since a$ < ac < cc, RANK(formula image)=1, RANK(formula image)=2, and RANK(formula image)=0. Therefore, formula image 120.
Figure 3:
Figure 3:
Divide Phase. (a) String T with its suffixes classified as formula image, formula image, formula image. (b) Construction of reduced instance formula image by lexical naming.
Figure 4:
Figure 4:
Buckets of a DNA-string suffix array of length n. Gray indicates formula image-type positions. The bucket for T does not have a subbucket for formula image because there cannot be any formula image suffix starting with the lexically greatest character of the alphabet.
Figure 5:
Figure 5:
Array A at the end of Step 0 in which the formula image suffixes have been placed in their buckets in sorted order. Gray indicates formula image-type positions. This order of formula image suffixes is obtained from recursion.
Figure 6:
Figure 6:
Animation of Step 1 of the combine phase as the sweep proceeds from left to right. The original text T is also shown for reference. The formula image symbols point to the current heads of formula image-type subbuckets, the ∙ symbol shows the current position of the sweep and cells with thick boundaries indicate changes. For example, in the topmost row, suffix index 13 is encountered; and as formula image is formula image-type, 12 is inserted at A5, the current head of the bucket for formula image-type suffixes starting with g. The sweep proceeds accordingly. Whenever a pointer reaches the edge of its bucket, we change its representation to a dashed arrow. From sweep position 10 onwards, the array does not change and so this animation excludes those steps.
Figure 7:
Figure 7:
Animation of Step 2 of the combine phase as the sweep proceeds from right to left. The original text T is also shown for reference. The formula image symbols point to the current tails of formula image-type subbuckets, the ∙ symbol shows the current position of the sweep, and cells with thick boundaries indicate changes. For example, in the topmost row, suffix index 0 is encountered, and therefore no action needs to be taken. Next, suffix index 2 is encountered; and as formula image is formula image-type, 1 is inserted at A9, the current tail of the bucket for formula image-type suffixes starting with g. Whenever a pointer reaches the edge of its bucket, we change its representation to a dashed arrow. From sweep position 2 onwards, the array does not change and so this animation excludes those steps.
Figure 8:
Figure 8:
Time and memory performance of implementations of select suffix sorting algorithms. The shapes of the markers distinguish the different programs, and the fill-styles distinguish the data sets. The nonfilled markers connected by lines correspond to the performance for the four data sets constructed from increasingly many polymorphic copies of human chromosome 22.
Figure 9:
Figure 9:
Contrasting the ordinary suffix array (left) of cagctat$ with its spaced suffix array under mask 101 (right). The characters at don’t-care positions have been replaced by *.
Figure 10:
Figure 10:
A demonstration of how DisLex constructs a spaced suffix array using an example string formula image atggacgacac$ and mask formula image 101. The characters at the 0 positions of the mask have been mapped to the character formula image. (a) The input string with extra padding. (b) Lexically sorting all the length-3 distinct substrings of T. The mapping RANK is defined using this ordering. (c), (d), (e) Constructing formula image, formula image and formula image, respectively. (f) Constructing formula image by concatenating formula image, formula image and formula image. (g) The suffix array of formula image (above) is transformed to the spaced suffix array of T (below).

Similar articles

Cited by

References

    1. Weiner P. Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on. Washington DC: IEEE Computer Society; 1973. Linear pattern matching algorithms; pp. 1–11.
    1. Delcher AL, Kasif S, Fleischmann RD, et al. Alignment of whole genomes. Nucleic Acids Res. 1999;27(11):2369–76. - PMC - PubMed
    1. Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. New York: Cambridge University Press; 1997.
    1. Kurtz S. Reducing the space requirement of suffix trees. Softw Pract & Exp. 1999;29:1149–71.
    1. Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms. 2004;2(1):53–86.

Publication types

MeSH terms