A bioinformatician's guide to the forefront of suffix array construction algorithms
- PMID: 24413184
- PMCID: PMC3956071
- DOI: 10.1093/bib/bbt081
A bioinformatician's guide to the forefront of suffix array construction algorithms
Abstract
The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support 'spaced seeds' and 'subset seeds' used in many biological applications.
Keywords: linear-time algorithm; spaced seeds; subset seeds; suffix array construction; text index.
Figures


)=1, RANK(
)=2, and RANK(
)=0. Therefore,
,
,
. (b) Construction of reduced instance
by lexical naming.
-type positions. The bucket for T does not have a subbucket for
because there cannot be any
suffix starting with the lexically greatest character of the alphabet.
suffixes have been placed in their buckets in sorted order. Gray indicates
-type positions. This order of
suffixes is obtained from recursion.
symbols point to the current heads of
-type subbuckets, the ∙ symbol shows the current position of the sweep and cells with thick boundaries indicate changes. For example, in the topmost row, suffix index 13 is encountered; and as
is
-type, 12 is inserted at A5, the current head of the bucket for
-type suffixes starting with
symbols point to the current tails of
-type subbuckets, the ∙ symbol shows the current position of the sweep, and cells with thick boundaries indicate changes. For example, in the topmost row, suffix index 0 is encountered, and therefore no action needs to be taken. Next, suffix index 2 is encountered; and as
is
-type, 1 is inserted at A9, the current tail of the bucket for
-type suffixes starting with
. (a) The input string with extra padding. (b) Lexically sorting all the length-3 distinct substrings of T. The mapping RANK is defined using this ordering. (c), (d), (e) Constructing
,
and
, respectively. (f) Constructing
by concatenating
,
and
. (g) The suffix array of
(above) is transformed to the spaced suffix array of T (below).Similar articles
-
Indexing huge genome sequences for solving various problems.Genome Inform. 2001;12:175-83. Genome Inform. 2001. PMID: 11791236
-
gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020. Algorithms Mol Biol. 2020. PMID: 32973918 Free PMC article.
-
mkESA: enhanced suffix array construction tool.Bioinformatics. 2009 Apr 15;25(8):1084-5. doi: 10.1093/bioinformatics/btp112. Epub 2009 Feb 26. Bioinformatics. 2009. PMID: 19246510 Free PMC article.
-
A space-efficient construction of the Burrows-Wheeler transform for genomic data.J Comput Biol. 2005 Sep;12(7):943-51. doi: 10.1089/cmb.2005.12.943. J Comput Biol. 2005. PMID: 16201914 Review.
-
Penalized feature selection and classification in bioinformatics.Brief Bioinform. 2008 Sep;9(5):392-403. doi: 10.1093/bib/bbn027. Epub 2008 Jun 18. Brief Bioinform. 2008. PMID: 18562478 Free PMC article. Review.
Cited by
-
Lightweight Pattern Matching Method for DNA Sequencing in Internet of Medical Things.Comput Intell Neurosci. 2022 Sep 8;2022:6980335. doi: 10.1155/2022/6980335. eCollection 2022. Comput Intell Neurosci. 2022. PMID: 36120669 Free PMC article.
-
Establishment of a polymerase chain reaction-based method for strain-level management of Enterococcus faecalis EF-2001 using species-specific sequences identified by whole genome sequences.Front Microbiol. 2022 Aug 12;13:959063. doi: 10.3389/fmicb.2022.959063. eCollection 2022. Front Microbiol. 2022. PMID: 36033901 Free PMC article.
-
The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences.Viruses. 2019 Apr 26;11(5):394. doi: 10.3390/v11050394. Viruses. 2019. PMID: 31035503 Free PMC article.
-
RIblast: an ultrafast RNA-RNA interaction prediction system based on a seed-and-extension approach.Bioinformatics. 2017 Sep 1;33(17):2666-2674. doi: 10.1093/bioinformatics/btx287. Bioinformatics. 2017. PMID: 28459942 Free PMC article.
-
Two Efficient Techniques to Find Approximate Overlaps between Sequences.Biomed Res Int. 2017;2017:2731385. doi: 10.1155/2017/2731385. Epub 2017 Feb 15. Biomed Res Int. 2017. PMID: 28293632 Free PMC article.
References
-
- Weiner P. Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on. Washington DC: IEEE Computer Society; 1973. Linear pattern matching algorithms; pp. 1–11.
-
- Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. New York: Cambridge University Press; 1997.
-
- Kurtz S. Reducing the space requirement of suffix trees. Softw Pract & Exp. 1999;29:1149–71.
-
- Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms. 2004;2(1):53–86.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
