In the realm of bioinformatics and computational biology, the most rudimentary data upon which all the analysis is built is the sequence data of genes, proteins and RNA. The sequence data of the entire genome is the solution to the genome assembly problem. The scope of this contribution is to provide an overview on the art of problem-solving applied within the domain of genome assembly in the next-generation sequencing (NGS) platforms. This article discusses the major genome assemblers that were proposed in the literature during the past decade by outlining their basic working principles. It is intended to act as a qualitative, not a quantitative, tutorial to all working on genome assemblers pertaining to the next generation of sequencers. We discuss the theoretical aspects of various genome assemblers, identifying their working schemes. We also discuss briefly the direction in which the area is headed towards along with discussing core issues on software simplicity.
Copyright © 2012 Beijing Institute of Genomics, Chinese Academy of Sciences. Published by Elsevier Ltd. All rights reserved.
Supplementary Figure 1
Supplementary Figure 10
Supplementary Figure 11
Supplementary Figure 12
Supplementary Figure 13
Supplementary Figure 14
Supplementary Figure 15
Supplementary Figure 16
Supplementary Figure 17
Supplementary Figure 18
Supplementary Figure 19
Supplementary Figure 2
Supplementary Figure 3
Supplementary Figure 4
Supplementary Figure 5
Supplementary Figure 6
Supplementary Figure 7
Supplementary Figure 8
Supplementary Figure 9
Leading advancements in sequencing schemes during 2000–2010 Please note that this figure is not an exhaustive list, but it lists the major developments.
Schemes and their associated algorithms The figure depicts the most fundamental schemes adopted by assembly algorithms. The algorithms have been listed in order to clarify fundamental concepts; however, the same algorithm can be categorized into more than one approach. For instance, all Eulerian path approach algorithms could be categorized under graph-based schemes. However, assisted assembly can be categorized under both comparative assembly and the overlap-layout-consensus approach since it uses concepts from both.
Graph correction techniques (
A) Disambiguation: the loop edge is unrolled and integrated in the continuous edge from left to right. ( B) Pulling apart operation: the case shown could have four possible options as shown in panel ( C–F). However, it is assumed here that there are only two possible paths as shown in ( C) and ( F), black going to black, and shaded region going to shaded region respectively, in which case the middle sequence (black) is duplicated and the two disconnected paths are made. ( C–F) Eulerian super-path to eulerian path transformation: solving repeats. Repeats create difficulties since the algorithm cannot identify the correct path whether the path is shown is ( C, D, E or F), respectively. Two paths are consistent if their union is a path again. For multiple edges there are 3 possibilities: (i) Path X, shown above, is consistent with exactly one of the sets in ( C or D), as there is only one solution; (ii) X is consistent with neither of paths shown in ( C or D); (iii) X is consistent with both ( C and D). (ii) and (iii) are resolved after determining (i) for the entire graph and removing all poor quality reads. ( G) Removing nodes: nodes that have an indegree = outdegree = 1 are collapsed to form one giant node called unitig. ( H) Removing edges: an edge between ( A and C) is removed if edges between ( A and C), or ( C and C), exist. ( I) Velvet – removing tips: a tip is defined as a chain of nodes that is disconnected at one end. Tips are removed based upon length and minority count. If a tip is smaller than 2 k, then it is removed. Minority count property suggests the point at which the tip connects to the graph (the parent from which initial branching took place); if there is a longer path, or a more common path, then the tip is removed. In this case c is removed. Edena-dead-end path removal: similar to removing tips, here each path starting from a branching node is checked to see whether its depth is greater than or not. Heuristically, ϒ = 10. If not, then all the nodes in the path, excluding the branching node, are removed. These short paths are normally caused by base calling errors. ( ϒ J) E 1 − E 2 – detachment: edges E 1 and E 2 are replaced by a new edge E 3 that directs all paths from V to in V. ( out K) Removal of transitive edges: if E 1 < E 2 such that edge E 1 is overlapped by E 2, then E 1 is a transitive edge and is removed.
Making the A-Bruijn graphs (
A) Using pair-wise alignments an A-Bruijn graph is built from the sequence. ( G B) Pair-wise alignments are calculated: ATG versus AT, while in ( C), it is AT versus AAG and ATG versus AAG. ( D) Final assembly of the A-Bruijn graph after collapsing all similar nodes. Resultant graph may contain whirls and bulges which need to be rectified. Herein, G A → T ← A is a whirl. ( E) In the event of a mismatch, two nodes are collapsed and both instances are kept. Figure was adapted from .
Multiple local alignments (
A) Spectrum: collection of all reads. ( B) Collection of all k-mers derived from the reads. ( C) The multiple alignment algorithms take a collection of unique k-mers. Then for each k-mer it does a local alignment with all the reads identifying the reads in which it is present, along with the starting position of the alignment within the read and also the orientation of its alignment. Using this, an overall alignment of the entire spectrum is obtained. ( D) Unipath intervals: set of all k-mer path intervals. ( E) Creating unipaths: take the first k-mer path interval, and its predecessor and concatenate it. Similarly take the last k-mer in the unipath, and its successor and concatenate it. Repeat iteratively this process to create unipaths. The unipath interval that is obtained using several k-mer path intervals in this example is [C, H]. ( F) Branches: a branch in a graph is the point in the genome where there is a k-mer that appears in two or more places for which the next (or previous) bases are different. Here ([A, Z]), [A, H]) and ([Z, B], [H, B]) form a branch.
Velvet – making the database and the graph Velvet uses two databases (A) and (B) which are combined in a somewhat similar fashion to the database used by Allpaths in Figure 5. A hash table is used in (A) to store every
k-mer, the ID of the first read encountered containing that k-mer and the start position of its occurrence within that read, and additionally its reverse complement. The second database (B) records, for each read which of its original k-mers are overlapped by other reads. Using (A) and (B), a third list of ordered original (unique) k-mers is made. The list is compartmentalized each time an overlap with another start or end occurs. The continuous set of reads in each compartment form the nodes of the graph in (C). Overlap between the last k-mers of one node and the first k-mer of the next node produces a directed edge (shown as yellow line) (C1). The blue lines represent the overlap of k-1 nucleotides between k-mers in the same node (C2). Furthermore, whenever an edge exists between a single parent N and its only child node then the two nodes are merged. This figure is adapted from .
Velvet – removing bubbles As we progress along graph simplification, we see that shorter paths are being merged with longer paths. From (A) to (B),
C is merged with C′ to form C′, and B is merged with B′ to form B. A similar process is repeated by merging B and B″ to form graph shown in panel (C) and finally panel (D). p-bubble fixing in EDENA is similar to that in Velvet. p-bubbles are branches caused by a single base substitution. Each branch is explored up to length δ. The length of the p-bubble is at most . These are resolved by removing nodes on the less covered side of the bubble. δ = 4 × ( read length ) - 2 × ( minimum required overlap size ) - 1
Assisted assembly (
A) The start/stop point of every read is inferred via local alignment with the reference genome. Reads are allowed to be placed more than once as well, in order to allow for duplicated regions within the genome. ( B) If the start/stop positions of the reads overlap other reads by user-defined X number of bases, then the reads are grouped in one group. All such reads that overlap based on their positions in the reference genome form groups. ( C) All groups are used to enlarge pre-existing contigs. If the group belongs to one contig, then that contig is enlarged (C1). If the group belongs to two contigs then the closest one is enlarged (C2). If a group does not belong to any contig, then the group itself becomes a new contig (C3). Once all the groups are dealt with, reads are taken from the groups one by one and are aligned with the contigs to extend them.
AMOS-Cmp with layout refinement and gene boosted assembly Insertions in the reference (top left): (
A) This is identified by reads aligned such that they span across the inserted area of the reference genome and align perfectly to either side of this inserted area. ( B) In such case the ‘seeming gap’ is closed. ( C) The genome surrounding the inserted area is considered as one contig. Insertions in the target sequence (top right): (D). This is identified by reads whose former portions align perfectly yet the latter portions diverge from the reference sequence. ( E) This is resolved by breaking up the target genome at the point of insertion producing two contigs which are then (F) connected using the singletons of any assembly algorithm. Rearrangement (bottom right): regions 2 and 3 differ in their order and orientation from the reference and the target. Reads (G) and (H) match disjoint locations of the reference genome, shown as being connected via dashed lines. Insertions in the target sequence: this is resolved by breaking up the target genome at the point of insertion producing two contigs which are then connected using the singletons of any assembly algorithm. Figure was adapted from . Gene boosted assembly (bottom left): contigs (A and B) form target 1, while contigs (C, D and E) form target 2. This method shows how two comparative assemblies can be used to close the gaps that occur in genome assembly. The target genome merges the contigs (A, B, C, D and E) to achieve the target genome. The shaded regions in the target genome and their corresponding location in the contigs show how this simple and elegant scheme works.
SHARCGS for contig extension The shaded region in the contig to be extended is used as a prefix to gather all reads that share the same prefix. A prefix tree is employed for efficient search for all plausible reads having the same prefix. The ‘Extension’ region of the read “R” is the plausible extension sequence of the contig. To determine amongst all possible reads which one is to be used for extending the contig, a check sequence is employed “M”. This is made by combining the last ‘r’ bases of the extending contig and the extension region of R. Sub-strings of M are made and act as prefixes for searching other reads in the prefix tree. If all sub-strings retrieve one possible read whose prefix matches it, then the contig is extended, otherwise it is not.
All figures (29)
Next-generation sequence assembly: four stages of data processing and computational challenges.
PLoS Comput Biol. 2013;9(12):e1003345. doi: 10.1371/journal.pcbi.1003345. Epub 2013 Dec 12.
PLoS Comput Biol. 2013.
24348224 Free PMC article.
Genome assembly reborn: recent computational challenges.
Brief Bioinform. 2009 Jul;10(4):354-66. doi: 10.1093/bib/bbp026. Epub 2009 May 29.
Brief Bioinform. 2009.
19482960 Free PMC article.
Sequence assembly using next generation sequencing data--challenges and solutions.
Sci China Life Sci. 2014 Nov;57(11):1140-8. doi: 10.1007/s11427-014-4752-9. Epub 2014 Oct 17.
Sci China Life Sci. 2014.
A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies.
PLoS One. 2011 Mar 14;6(3):e17915. doi: 10.1371/journal.pone.0017915.
PLoS One. 2011.
21423806 Free PMC article.
FastEtch: A Fast Sketch-Based Assembler for Genomes.
IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1091-1106. doi: 10.1109/TCBB.2017.2737999. Epub 2017 Sep 11.
IEEE/ACM Trans Comput Biol Bioinform. 2019.
The A, C, G, and T of Genome Assembly.
Biomed Res Int. 2016;2016:6329217. doi: 10.1155/2016/6329217. Epub 2016 May 10.
Biomed Res Int. 2016.
27247941 Free PMC article.
Complete sequence of pABTJ2, a plasmid from Acinetobacter baumannii MDR-TJ, carrying many phage-like elements.
Genomics Proteomics Bioinformatics. 2014 Aug;12(4):172-7. doi: 10.1016/j.gpb.2014.05.001. Epub 2014 Jul 19.
Genomics Proteomics Bioinformatics. 2014.
25046542 Free PMC article.
Dissection of the octoploid strawberry genome by deep sequencing of the genomes of Fragaria species.
DNA Res. 2014;21(2):169-81. doi: 10.1093/dnares/dst049. Epub 2013 Nov 26.
DNA Res. 2014.
24282021 Free PMC article.
Optimal reference sequence selection for genome assembly using minimum description length principle.
EURASIP J Bioinform Syst Biol. 2012 Nov 27;2012(1):18. doi: 10.1186/1687-4153-2012-18.
EURASIP J Bioinform Syst Biol. 2012.
23186305 Free PMC article.
Oxford Molecular Group PLC. AssemblyLIGN 1.0. 9. Oxford, United Kingdom: Oxford Molecular Group PLC; 1998.
Broveak T. Geneworks. Biotechnol Software Internet J. 1996;13:1114.
Parker S. Autoassembler sequence assembly software. Methods Mol Biol. 1997;70:107–118.
Swindell S.R., Plasterer T.N. SEQMAN. Contig assembly. Methods Mol Biol. 1997;70:75–89.
Bromberg C. Gene Codes Corporation; 1995. Sequencher.
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Sequence Analysis, DNA / methods*
LinkOut - more resources
Full Text Sources Other Literature Sources Miscellaneous