Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 456 (7218), 53-9

Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry

David R Bentley  1 Shankar BalasubramanianHarold P SwerdlowGeoffrey P SmithJohn MiltonClive G BrownKevin P HallDirk J EversColin L BarnesHelen R BignellJonathan M BoutellJason BryantRichard J CarterR Keira CheethamAnthony J CoxDarren J EllisMichael R FlatbushNiall A GormleySean J HumphrayLeslie J IrvingMirian S KarbelashviliScott M KirkHeng LiXiaohai LiuKlaus S MaisingerLisa J MurrayBojan ObradovicTobias OstMichael L ParkinsonMark R PrattIsabelle M J RasolonjatovoMark T ReedRoberto RigattiChiara RodighieroMark T RossAndrea SabotSubramanian V SankarAylwyn ScallyGary P SchrothMark E SmithVincent P SmithAnastassia SpiridouPeta E TorranceSvilen S TzonevEric H VermaasKlaudia WalterXiaolin WuLu ZhangMohammed D AlamCarole AnastasiIfy C AnieboDavid M D BaileyIain R BancarzSaibal BanerjeeSelena G BarbourPrimo A BaybayanVincent A BenoitKevin F BensonClaire BevisPhillip J BlackAsha BoodhunJoe S BrennanJohn A BridghamRob C BrownAndrew A BrownDale H BuermannAbass A BunduJames C BurrowsNigel P CarterNestor CastilloMaria Chiara E CatenazziSimon ChangR Neil CooleyNatasha R CrakeOlubunmi O DadaKonstantinos D DiakoumakosBelen Dominguez-FernandezDavid J EarnshawUgonna C EgbujorDavid W ElmoreSergey S EtchinMark R EwanMilan FedurcoLouise J FraserKarin V Fuentes FajardoW Scott FureyDavid GeorgeKimberley J GietzenColin P GoddardGeorge S GoldaPhilip A GranieriDavid E GreenDavid L GustafsonNancy F HansenKevin HarnishChristian D HaudenschildNarinder I HeyerMatthew M HimsJohnny T HoAdrian M HorganKatya HoschlerSteve HurwitzDenis V IvanovMaria Q JohnsonTerena JamesT A Huw JonesGyoung-Dong KangTzvetana H KerelskaAlan D KerseyIrina KhrebtukovaAlex P KindwallZoya KingsburyPaula I Kokko-GonzalesAnil KumarMarc A LaurentCynthia T LawleySarah E LeeXavier LeeArnold K LiaoJennifer A LochMitch LokShujun LuoRadhika M MammenJohn W MartinPatrick G McCauleyPaul McNittParul MehtaKeith W MoonJoe W MullensTaksina NewingtonZemin NingBee Ling NgSonia M NovoMichael J O'NeillMark A OsborneAndrew OsnowskiOmead OstadanLambros L ParaschosLea PickeringAndrew C PikeAlger C PikeD Chris PinkardDaniel P PliskinJoe PodhaskyVictor J QuijanoCome RaczyVicki H RaeStephen R RawlingsAna Chiva RodriguezPhyllida M RoeJohn RogersMaria C Rogert BacigalupoNikolai RomanovAnthony RomieuRithy K RothNatalie J RourkeSilke T RuedigerEli RusmanRaquel M Sanches-KuiperMartin R SchenkerJosefina M SeoaneRichard J ShawMitch K ShiverSteven W ShortNing L SiztoJohannes P SluisMelanie A SmithJean Ernest Sohna SohnaEric J SpenceKim StevensNeil SuttonLukasz SzajkowskiCarolyn L TregidgoGerardo TurcattiStephanie VandevondeleYuli VerhovskySelene M VirkSuzanne WakelinGregory C WalcottJingwen WangGraham J WorsleyJuying YanLing YauMike ZuerleinJane RogersJames C MullikinMatthew E HurlesNick J McCookeJohn S WestFrank L OaksPeter L LundbergDavid KlenermanRichard DurbinAnthony J Smith
Affiliations

Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry

David R Bentley et al. Nature.

Abstract

DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

Figures

Figure 1
Figure 1. Sample preparation
a. DNA fragments are generated e.g. by random shearing and joined to a pair of oligonucleotides in a forked adapter configuration. The ligated products are amplified using two oligonucleotide primers, resulting in double-stranded blunt-ended material with a different adapter sequence on either end. b. formation of clonal single molecule array. DNA fragments prepared as in a are denatured and single strands are annealed to complementary oligonucleotides on the flowcell surface (hatched in the figure). A new strand (dotted) is copied from the original strand in an extension reaction that is primed from the 3’ end of the surface-bound oligonucleotide, and the original strand is then removed by denaturation. The adapter sequence at the 3’ end of each copied strand is annealed to a new surface bound complementary oligonucleotide, forming a bridge and generating a new site for synthesis of a second strand (shown dotted). Multiple cycles of annealing, extension and denaturation in isothermal conditions result in growth of clusters each ~1micron in physical diameter. This follows the basic method outlined in ref c. The DNA in each cluster is linearised by cleavage within one adapter sequence (gap marked by an asterisk) and denatured, generating single stranded template for sequencing by synthesis to obtain a sequence read (read 1)(the sequencing product is shown dotted). To perform paired-read sequencing, the products of read 1 are removed by denaturation, the template is used to generate a bridge, the second strand is re-synthesised (shown dotted), and the opposite strand is then cleaved (gap marked by an asterisk) to provide the template for the second read (read 2). d. Long range paired end sample preparation. To sequence the ends of a long (e.g. >1 kb) DNA fragment, the ends of each fragment are tagged by incorporation of biotinylated (B) nucleotide and then circularised, forming a junction between the two ends. Circularised DNA is randomly fragmented and the biotinylated junction fragments are recovered and used as starting material in the standard sample preparation procedure illustrated in a above. The orientation of the sequence reads relative to the DNA fragment is tracked in the figure by magenta arrows. When aligned to the reference sequence, these reads are oriented with their 5’ ends towards each other (in contrast to the short insert paired reads produced as shown in a–c). See fig S17a for examples of both. Turquoise and blue lines represent oligonucleotides and red lines represent genomic DNA. Note that all surface-bound oligonucleotides are attached to the flowcell by their 5’ ends. Dotted lines indicate newly synthesized strands during cluster formation or sequencing. See supplementary methods for details.
Figure 2
Figure 2. X chromosome data
a. Distribution of mapped read depth in the X chromosome dataset, sampled at every 50th position along the chromosome and displayed as a histogram (‘all’). An equivalent analysis of mapped read depth for the unique subset of these positions is also shown (‘unique only’). The solid line represents a Poisson distribution with the same mean. b. Distribution of X chromosome uniquely mapped reads as a function of GC content. Note that the x axis is % GC content and is scaled by percentile of unique sequence. The solid line is average mapped depth of unique sequence; the grey region is the central 80% of the data (10th to 90th centiles); the dashed lines are 10th and 90th centiles of a Poisson distribution with the same mean as the data.
Figure 2
Figure 2. X chromosome data
a. Distribution of mapped read depth in the X chromosome dataset, sampled at every 50th position along the chromosome and displayed as a histogram (‘all’). An equivalent analysis of mapped read depth for the unique subset of these positions is also shown (‘unique only’). The solid line represents a Poisson distribution with the same mean. b. Distribution of X chromosome uniquely mapped reads as a function of GC content. Note that the x axis is % GC content and is scaled by percentile of unique sequence. The solid line is average mapped depth of unique sequence; the grey region is the central 80% of the data (10th to 90th centiles); the dashed lines are 10th and 90th centiles of a Poisson distribution with the same mean as the data.
Figure 3
Figure 3. SNPs identified in the human genome sequence of NA18507
a. number of SNPs detected by class and % in dbSNP (release 128). Results from ELAND and MAQ alignments are reported separately. b. Overlap of SNPs detected in each analysis reveals extensive overlap. The % of NA18507 SNP calls that match previous entries in dbSNP is lower than that of our X chromosome study (see fig S6). We expect this because individual NA07340 (from the X study) was also previously used for discovery and submission of SNPs to dbSNP during the HapMap project, in contrast to NA18507.
Figure 4
Figure 4. Homozygous complex rearrangement detected by anomalous paired reads. The rearrangement involves an inversion of 369 bp (blue-turquoise bar in the schematic) flanked by deletions (red bars) of 1206 and 164 bp, respectively, at the left and right hand breakpoints
a. summary tracks in the Resembl browser, denoting scale, simulated alignability of reads to reference (blue plot), actual aligned depth of coverage by NA18507 reads (green plot), density of anomalous reads indicating structural variants (red plot; peaks denote ‘hotspots’), density of singleton reads (pink plot). b. anomalous long insert read pairs (orange lines denote DNA fragment, blocks at either end denote each read); the data indicate loss of ~1.3kb in NA18507 relative to the reference. c. anomalous short insert pairs of two types (red and pink) indicate an inverted sequence flanked by two deletions. d. normal short insert read pair alignments (each green line denotes the extent of the reference that is covered by the short fragment, including the two reads). e. The schematic depicts the arrangement of normal and anomalous read pairs relative to the rearrangement. Top line: structure of NA18507, second line: structure of reference sequence. Green bars denote sequence that is collinear in the reference and NA18507. The turquoise-blue bar illustrates the inverted segment. Red bars indicate the sequences present in the reference but absent in NA18507. Arrows denote orientation of reads when aligned to the reference. Note that the display in a–d is a composite of screen shots of the same window, overlapped for display purposes in this figure.
Figure 5
Figure 5. Effect of sequence depth on coverage and accuracy of human genome sequencing. ELAND alignments were used for this analysis
a. Accumulation of sequence-based SNP calls, including all SNPs (squares), heterozygous SNPs (triangles) and homozygous SNPs (circles) with increasing input read depth. b. Decrease in genotype positions not covered by sequence (squares), heterozygote undercalls in sequence data relative to genotype data (triangles) and discordant SNP calls compared to genotypes (circles) with increasing input read depth. Vertical dotted lines indicate various input read depths (10x, 15x, 30x haploid genome).
Figure 5
Figure 5. Effect of sequence depth on coverage and accuracy of human genome sequencing. ELAND alignments were used for this analysis
a. Accumulation of sequence-based SNP calls, including all SNPs (squares), heterozygous SNPs (triangles) and homozygous SNPs (circles) with increasing input read depth. b. Decrease in genotype positions not covered by sequence (squares), heterozygote undercalls in sequence data relative to genotype data (triangles) and discordant SNP calls compared to genotypes (circles) with increasing input read depth. Vertical dotted lines indicate various input read depths (10x, 15x, 30x haploid genome).

Comment in

Similar articles

  • The Complete Genome of an Individual by Massively Parallel DNA Sequencing
    DA Wheeler et al. Nature 452 (7189), 872-6. PMID 18421352.
    The association of genetic variation with disease and drug response, and improvements in nucleic acid technologies, have given great optimism for the impact of 'genomic m …
  • A Map of Human Genome Variation From Population-Scale Sequencing
    1000 Genomes Project Consortium et al. Nature 467 (7319), 1061-73. PMID 20981092.
    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype an …
  • Fast and Accurate Genomic Analyses Using Genome Graphs
    G Rakocevic et al. Nat Genet 51 (2), 354-362. PMID 30643257.
    The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus …
  • Whole Genome Sequencing
    PC Ng et al. Methods Mol Biol 628, 215-26. PMID 20238084. - Review
    Whole genome sequencing provides the most comprehensive collection of an individual's genetic variation. With the falling costs of sequencing technology, we envision para …
  • Whole-genome Re-Sequencing
    DR Bentley. Curr Opin Genet Dev 16 (6), 545-52. PMID 17055251. - Review
    DNA sequencing can be used to gain important information on genes, genetic variation and gene function for biological and medical studies. The growing collection of publi …
See all similar articles

Cited by 1,399 PubMed Central articles

See all "Cited by" articles

References

    1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. - PubMed
    1. Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. - PMC - PubMed
    1. Margulies M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Shendure J, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309:1728–1732. - PubMed
    1. Harris TD, et al. Single-molecule DNA sequencing of a viral genome. Science. 2008;320:106–109. - PubMed

Publication types

Feedback