Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 26 (12), i367-73

Efficient Construction of an Assembly String Graph Using the FM-index

Affiliations

Efficient Construction of an Assembly String Graph Using the FM-index

Jared T Simpson et al. Bioinformatics.

Abstract

Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms.

Results: Standard overlap assembly methods have time complexity O(N(2)), where N is the sum of the lengths of the reads. We use the Ferragina-Manzini index (FM-index) derived from the Burrows-Wheeler transform to find overlaps of length at least tau among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.

Figures

Fig. 1.
Fig. 1.
Diagram of a simple string graph. Three overlapping reads (R1, R2, R3) are shown in (A). (B) shows the string graph constructed from the overlaps between the reads. The arrowheads pointing into the nodes depict an edge of type B and arrowheads pointing away from the nodes depict edges of type E. The edge R1 ↔ R3 is transitive.
Fig. 2.
Fig. 2.
The running time of the direct and exhaustive overlap algorithms for simulated E. coli data with sequence depth from 5× to 100×. The direct overlap algorithm scales linearly with sequence depth. As the number of overlaps grows quadratically with sequence depth, the exhaustive overlap algorithm exhibits above-linear scaling.

Similar articles

See all similar articles

Cited by 60 PubMed Central articles

See all "Cited by" articles

References

    1. Bentley JL, Sedgewick R. SODA '97: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1997. Fast algorithms for sorting and searching strings; pp. 360–369.
    1. Burrows M, Wheeler DJ. Technical report 124. Palo Alto, CA: Digital Equipment Corporation; 1994. A block-sorting lossless data compression algorithm.
    1. Chaisson MJ, Pevzner PA. Short read fragment assembly of bacterial genomes. Genome Res. 2008;18:324–330. - PMC - PubMed
    1. Dementiev R, et al. Better external memory suffix array construction. J. Exp. Algorithmics. 2008;12:1–24.
    1. Ferragina P, Manzini G. Proceedings of the 41st Symposium on Foundations of Computer Science (FOCS 2000) Los Alamitos, CA, USA: IEEE Computer Society; 2000. Opportunistic data structures with applications; pp. 390–398.

Publication types

Feedback