Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 526 (7571), 68-74

A Global Reference for Human Genetic Variation

Collaborators

A Global Reference for Human Genetic Variation

1000 Genomes Project Consortium et al. Nature.

Abstract

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Figures

Extended Data Figure 1
Extended Data Figure 1. Summary of the callset generation pipeline
Boxes indicate steps in the process and numbers indicate the corresponding section(s) within the Supplementary Information.
Extended Data Figure 2
Extended Data Figure 2. Power of discovery and heterozygote genotype discordance
a, The power of discovery within the main data set for SNPs and indels identified within an overlapping sample of 284 genomes sequenced to high coverage by Complete Genomics (CG), and against a panel of >60,000 haplotypes constructed by the Haplotype Reference Consortium (HRC). To provide a measure of uncertainty, one curve is plotted for each chromosome. b, Improved power of discovery in phase 3 compared to phase 1, as assessed in a sample of 170 Complete Genomics genomes that are included in both phase 1 and phase 3. c, Heterozygote discordance in phase 3 for SNPs, indels, and SVs compared to 284 Complete Genomics genomes. d, Heterozygote discordance for phase 3 compared to phase 1 within the intersecting sample. e, Sensitivity to detect Complete Genomics SNPs as a function of sequencing depth. f, Heterozygote genotype discordance as a function of sequencing depth, as compared to Complete Genomics data.
Extended Data Figure 3
Extended Data Figure 3. Variant counts
a, The number of variants within the phase 3 sample as a function of alternative allele frequency. b, The average number of detected variants per genome with whole-sample allele frequencies <0.5% (grey bars), with the average number of singletons indicated by colours.
Extended Data Figure 4
Extended Data Figure 4. The standardized number of variant sites per genome, partitioned by population and variant category
For each category, z-scores were calculated by subtracting the mean number of sites per genome (calculated across the whole sample), and dividing by the standard deviation. From left: sites with a derived allele, synonymous sites with a derived allele, nonsynonymous sites with a derived allele, sites with a loss-of-function allele, sites with a HGMD disease mutation allele, sites with a ClinVar pathogenic variant, and sites carrying a GWAS risk allele.
Extended Data Figure 5
Extended Data Figure 5. Population structure as inferred using the admixture program for K = 5 to 12
Extended Data Figure 6
Extended Data Figure 6. Allelic sharing
a, Genotype covariance (above diagonal) and sharing of f2 variants (below diagonal) between pairs of individuals. b, Quantification of average f2 sharing between populations. Each row represents the distribution of f2 variants shared between individuals from the population indicated on the left to individuals from each of the sampled populations. c, The average number of f2 variants per haploid genome. d, The inferred age of f2 variants, as estimated from shared haplotype lengths, with black dots indicating the median value.
Extended Data Figure 7
Extended Data Figure 7. Unsmoothed PSMC curves
a, The median PSMC curve for each population. b, PSMC curves estimated separately for all individuals within the 1000 Genomes sample. c, Unsmoothed PSMC curves comparing estimates from the low coverage data (dashed lines) to those obtained from high coverage PCR-free data (solid lines). Notable differences are confined to very recent time intervals, where the additional rare variants identified by deep sequencing suggest larger population sizes.
Extended Data Figure 8
Extended Data Figure 8. Genes showing very strong patterns of differentiation between pairs of closely related populations within each continental group
Within each continental group, the maximum PBS statistic was selected from all pairwise population comparisons within the continental group against all possible out-of-continent populations. Note the x axis shows the number of polymorphic sites within the maximal comparison.
Extended Data Figure 9
Extended Data Figure 9. Performance of imputation
a, Performance of imputation in 6 populations using a subset of phase 3 as a reference panel (n = 2,445), phase 1 (n = 1,065), and the corresponding data within intersecting samples from both phases (n = 1,006). b, Performance of imputation from phase 3 by variant class.
Extended Data Figure 10
Extended Data Figure 10. Decay of linkage disequilibrium as a function of physical distance
Linkage disequilibrium was calculated around 10,000 randomly selected polymorphic sites in each population, having first thinned each population down to the same sample size (61 individuals). The plotted line represents a 5 kb moving average.
Figure 1
Figure 1. Population sampling
a, Polymorphic variants within sampled populations. The area of each pie is proportional to the number of polymorphisms within a population. Pies are divided into four slices, representing variants private to a population (darker colour unique to population), private to a continental area (lighter colour shared across continental group), shared across continental areas (light grey), and shared across all continents (dark grey). Dashed lines indicate populations sampled outside of their ancestral continental region. b, The number of variant sites per genome. c, The average number of singletons per genome.
Figure 2
Figure 2. Population structure and demography
a, Population structure inferred using a maximum likelihood approach with 8 clusters. b, Changes to effective population sizes over time, inferred using PSMC. Lines represent the within-population median PSMC estimate, smoothed by fitting a cubic spline passing through bin midpoints.
Figure 3
Figure 3. Population differentiation
a, Variants found to be rare (<0.5%) within the global sample, but common (>5%) within a population. b, Genes showing strong differentiation between pairs of closely related populations. The vertical axis gives the maximum obtained value of the FST-based population branch statistic (PBS), with selected genes coloured to indicate the population in which the maximum value was achieved.
Figure 4
Figure 4. Imputation and eQTL discovery
a, Imputation accuracy as a function of allele frequency for six populations. The insert compares imputation accuracy between phase 3 and phase 1, using all samples (solid lines) and intersecting samples (dashed lines). b, The average number of tagging variants (r2 > 0.8) as a function of physical distance for common (top), low frequency (middle), and rare (bottom) variants. c, The proportion of top eQTL variants that are SNPs and indels, as discovered in 69 samples from each population. d, The percentage of eQTLs in TFBS, having performed discovery in the first population, and fine mapped by including an additional 69 samples from a second population (*P < 0.01, **P < 0.001, ***P < 0.0001, McNemar’s test). The diagonal represents the percentage of eQTLs in TFBS using the original discovery sample.

Comment in

Similar articles

  • An integrated map of structural variation in 2,504 human genomes.
    Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, Zhang Y, Ye K, Jun G, Fritz MH, Konkel MK, Malhotra A, Stütz AM, Shi X, Casale FP, Chen J, Hormozdiari F, Dayama G, Chen K, Malig M, Chaisson MJP, Walter K, Meiers S, Kashin S, Garrison E, Auton A, Lam HYK, Mu XJ, Alkan C, Antaki D, Bae T, Cerveira E, Chines P, Chong Z, Clarke L, Dal E, Ding L, Emery S, Fan X, Gujral M, Kahveci F, Kidd JM, Kong Y, Lameijer EW, McCarthy S, Flicek P, Gibbs RA, Marth G, Mason CE, Menelaou A, Muzny DM, Nelson BJ, Noor A, Parrish NF, Pendleton M, Quitadamo A, Raeder B, Schadt EE, Romanovitch M, Schlattl A, Sebra R, Shabalin AA, Untergasser A, Walker JA, Wang M, Yu F, Zhang C, Zhang J, Zheng-Bradley X, Zhou W, Zichner T, Sebat J, Batzer MA, McCarroll SA; 1000 Genomes Project Consortium, Mills RE, Gerstein MB, Bashir A, Stegle O, Devine SE, Lee C, Eichler EE, Korbel JO. Sudmant PH, et al. Nature. 2015 Oct 1;526(7571):75-81. doi: 10.1038/nature15394. Nature. 2015. PMID: 26432246 Free PMC article.
  • A map of human genome variation from population-scale sequencing.
    1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. 1000 Genomes Project Consortium, et al. Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534. Nature. 2010. PMID: 20981092 Free PMC article.
  • An integrated map of genetic variation from 1,092 human genomes.
    1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. 1000 Genomes Project Consortium, et al. Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632. Nature. 2012. PMID: 23128226 Free PMC article.
  • Genomic Analysis in the Age of Human Genome Sequencing.
    Lappalainen T, Scott AJ, Brandt M, Hall IM. Lappalainen T, et al. Cell. 2019 Mar 21;177(1):70-84. doi: 10.1016/j.cell.2019.02.032. Cell. 2019. PMID: 30901550 Free PMC article. Review.
  • Molecular genetic studies of complex phenotypes.
    Marian AJ. Marian AJ. Transl Res. 2012 Feb;159(2):64-79. doi: 10.1016/j.trsl.2011.08.001. Epub 2011 Aug 31. Transl Res. 2012. PMID: 22243791 Free PMC article. Review.
See all similar articles

Cited by 2,924 articles

See all "Cited by" articles

References

    1. The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed
    1. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. - PMC - PubMed
    1. Voight BF, et al. The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 2012;8:e1002793. - PMC - PubMed
    1. Trynka G, et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nature Genet. 2011;43:1193–1201. - PMC - PubMed
    1. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genet. 2012;44:955–959. - PMC - PubMed

Publication types

MeSH terms

Grant support

Feedback