Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jan;8(1):61-5.
doi: 10.1038/nmeth.1527. Epub 2010 Nov 21.

Limitations of Next-Generation Genome Sequence Assembly

Free PMC article

Limitations of Next-Generation Genome Sequence Assembly

Can Alkan et al. Nat Methods. .
Free PMC article


High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.


Figure 1
Figure 1
Summary of de novo genome assembly and new sequence analysis. (a) Venn diagram comparing insertion sequences (total base pairs that do not exist in the reference genome build 36) detected by fosmid end sequencing and de novo assembly for the same genome (NA18507). The number of base pairs of Epstein-Barr virus contamination is also shown. Approximately 1.6 Mbp of new insertion sequence aligns with 1.42 Mbp detected by de novo assembly with NGS. (b) Average sequence identity of L1 common repeat sequences and depletion ratio in the YH genome assembly. (c) The pairwise sequence identity distribution of duplicated sequences in the YH genome compared to the human reference genome and a WGS assembly based on capillary sequence (Celera). (d) Number of base pairs in segmental duplications detected in the YH assembly (YH WGAC) compared with duplications common to NCBI build 36 WGAC analysis (≥94% sequence identity) and read-depth analyses of the capillary-based (Celera) and YH (intersection of three datasets).

Comment in

Similar articles

See all similar articles

Cited by 284 articles

See all "Cited by" articles

Publication types

LinkOut - more resources