Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2004 Feb 17;101(7):1916-21.
doi: 10.1073/pnas.0307971100. Epub 2004 Feb 9.

Whole-genome Shotgun Assembly and Comparison of Human Genome Assemblies

Free PMC article
Comparative Study

Whole-genome Shotgun Assembly and Comparison of Human Genome Assemblies

Sorin Istrail et al. Proc Natl Acad Sci U S A. .
Free PMC article


We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304-1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860-921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.


Fig. 1.
Fig. 1.
Dot-plot representation of sample assembly comparison results. Horizontal axes correspond to intervals along NCBI-34, and vertical axes correspond to intervals along various assemblies, with the sequences starting from the bottom left corner. Diagonal lines show the relative positions and orientations of matches. Identical sequences would yield one diagonal line. Vertical bars represent gaps between NCBI-34 contigs. Selected regions were chosen to represent general observations regarding the assemblies; related figures of entire chromosomes are provided for all chromosomes in Data Set 7. (a) Illustration of a region in which WGSA can augment NCBI-34. Shown are the first 6 Mbp of NCBI-34 human chromosome 1 versus part of a single scaffold of WGSA. The second NCBI-34 contig is inverted, and the third and fourth contigs are interchanged, compared with WGSA. We postulate that this is an NCBI-34 contig mapping problem. Alternative explanations, such as misassembly or polymorphisms within the WGSA scaffold that coincidentally occur at the boundaries of NCBI-34 contigs, are improbable. (bf) Comparison of the NCBI-34 human chromosome 1 region from 34–40 Mbp against the primary matching regions of WGSA (b), WGA (c), CSA (d), HG06 (e), and NCBI-28 (f). (See main text for description of assemblies.) WGSA agrees closely with NCBI-34 and spans and largely fills two gaps between NCBI-34 contigs. All other assemblies have multiple order and orientation errors. For all but HG06, the misplaced segments correspond to entire scaffolds (data not shown). For HG06, errors are a mix of within-scaffold rearrangements and scaffold order and orientation. WGA and HG06 both have a relatively large number of small, misplaced scaffolds, whereas CSA and NCBI-28 have a few, larger scaffolds that are misplaced.
Fig. 2.
Fig. 2.
The proportion of the 19,667 RefSeq mRNA sequences that can be aligned to each of the genomes at various coverage thresholds and more than 95% sequence identity.

Similar articles

  • The sequence of the human genome.
    Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigó R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X. Venter JC, et al. Science. 2001 Feb 16;291(5507):1304-51. doi: 10.1126/science.1058040. Science. 2001. PMID: 11181995
  • The phusion assembler.
    Mullikin JC, Ning Z. Mullikin JC, et al. Genome Res. 2003 Jan;13(1):81-90. doi: 10.1101/gr.731003. Genome Res. 2003. PMID: 12529309 Free PMC article.
  • [Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].
    Zhang DL, Ji L, Li YD. Zhang DL, et al. Yi Chuan Xue Bao. 2004 May;31(5):431-43. Yi Chuan Xue Bao. 2004. PMID: 15478601 Chinese.
  • Whole genome sequencing.
    Ng PC, Kirkness EF. Ng PC, et al. Methods Mol Biol. 2010;628:215-26. doi: 10.1007/978-1-60327-367-1_12. Methods Mol Biol. 2010. PMID: 20238084 Review.
  • The A, C, G, and T of Genome Assembly.
    Wajid B, Sohail MU, Ekti AR, Serpedin E. Wajid B, et al. Biomed Res Int. 2016;2016:6329217. doi: 10.1155/2016/6329217. Epub 2016 May 10. Biomed Res Int. 2016. PMID: 27247941 Free PMC article. Review.
See all similar articles

Cited by 67 articles

See all "Cited by" articles

LinkOut - more resources