Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 491 (7422), 56-65

An Integrated Map of Genetic Variation From 1,092 Human Genomes

Collaborators

An Integrated Map of Genetic Variation From 1,092 Human Genomes

1000 Genomes Project Consortium et al. Nature.

Abstract

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

Figures

Figure 1
Figure 1. Power and accuracy
a, Power to detect SNPs as a function of variant count (and proportion) across the entire set of samples, estimated by comparison to independent SNP array data in the exome (green) and whole genome (blue). b, Genotype accuracy compared to the same SNP array data as a function of variant frequency summarised by the r2 between true and inferred genotype (coded as 0, 1 and 2) within the exome (green), whole genome after haplotype integration (blue) and whole genome without haplotype integration (red).
Figure 2
Figure 2. The distribution of rare and common variants
a, Summary of inferred haplotypes across a 100 kb region of chromosome 2 spanning the genes ALMS1 and NAT8, variation in which has been associated with kidney disease. Each row represents an estimated haplotype, with the population of origin indicated on the right. Reference alleles are indicated by the light blue background. Variants (non-reference alleles) above 0.5% frequency are indicated by pink (typed on the high density SNP array), white (previously known) and dark blue (not previously known). Low frequency variants (<0.5%) are indicated by blue crosses. Indels are indicated by green triangles and novel variants by dashes below. A large, low-frequency deletion (black line) spanning NAT8 is present in some populations. Multiple structural haplotypes mediated by segmental duplications are present at this locus, including copy number gains, which were not genotyped for this study. Within each population haplotypes are ordered by total variant count across the region. b, The fraction of variants identified across the project that are found in only one population (white line), are restricted to a single ancestry-based group (defined as in part A, solid colour), are found in all groups (solid black line) and are found in all populations (dotted black line). c, The density of the expected number of variants per kb carried by a genome drawn from each population, as a function of variant frequency (see Supplementary Information). Colours as for part a. Under a model of constant population size, the expected density is constant across the frequency spectrum.
Figure 3
Figure 3. Allele sharing within and between populations
a, Sharing of f2 variants, those found exactly twice across the entire sample, within and between populations. Each row represents the distribution across populations for the origin of samples sharing an f2 variant with the target population (indicated by the left-hand side). The grey bar represents the average number of f2 variants carried by a randomly-chosen genome in each population. b, Median length of haplotype identity (excluding cryptically-related samples and singleton variants and allowing for up to two genotype errors) between two chromosomes that share variants of a given frequency in each population. Estimates are from 200 randomly-sampled regions of 1 Mb each and up to 15 pairs of individuals for each variant. c, The average proportion of variants that are novel (compared to the pilot phase of the project) among those found in regions inferred to have different ancestries within ASW, PUR, CLM and MXL. Error bars represent 95% bootstrap confidence intervals.
Figure 4
Figure 4. Purifying selection within and between populations
a, The relationship between evolutionary conservation (measured by GERP score) and rare variant proportion (fraction of all variants with derived allele frequency < 0.5%) for variants occurring in different functional elements and with different coding consequences. Crosses indicate the average GERP score at variant sites (x-axis) and proportion of rare variants (y-axis) in each category. b, Levels of evolutionary conservation (mean GERP score, top) and genetic diversity (per nucleotide pairwise differences, bottom) for sequences matching the CTCF-binding motif within CTCF-binding peaks as experimentally identified by ChIP-Seq in the ENCODE project (blue) and in a matched set of motifs outside peaks (red). The logo plot shows the distribution of identified motifs within peaks. Error bars represent ± 2 s.e.m.
Figure 5
Figure 5. Implications of Phase 1 1000 Genomes data for GWAS
a, Accuracy of imputation of genome-wide SNPs, exome SNPs and indels (using sites on the Illumina 1M array) into 10 individuals of African ancestry (3 LWK, 4 Masaai from Kenya - MKK, 2 YRI) sequenced to high coverage by an independent technology. Only indels in regions of high sequence complexity with frequency >1% are analysed. Deletion imputation accuracy estimated by comparison to array data (note this is for a different set of individuals though with a similar ancestry, but included on the same plot for clarity). Accuracy measured by squared Pearson correlation coefficient between imputed and true dosage across all sites in a frequency range estimated from the 1000 Genomes data. Lines represent whole genome SNPs (solid), exome SNPs (long dashes), short indels (dotted) and large deletions (short dashes). b, The average number of variants in linkage disequilibrium (r2>0.5 among EUR) to focal SNPs identified in GWAS as a function of distance from the index SNP. Lines indicate the number of HapMap, Pilot and Phase 1 variants.
Figure 6
Figure 6

Comment in

  • A New Era of Human Population Genetics
    A Platt et al. Genome Biol 13 (12), 182. PMID 23268745.
    The 1000 Genomes Project Consortium has recently published an important early contribution to a new generation of systematic surveys of rare human genetic variation.

Similar articles

  • A Map of Human Genome Variation From Population-Scale Sequencing
    1000 Genomes Project Consortium et al. Nature 467 (7319), 1061-73. PMID 20981092.
    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype an …
  • A Global Reference for Human Genetic Variation
    1000 Genomes Project Consortium et al. Nature 526 (7571), 68-74. PMID 26432245.
    The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individu …
  • An Integrated Map of Structural Variation in 2,504 Human Genomes
    PH Sudmant et al. Nature 526 (7571), 75-81. PMID 26432246.
    Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight s …
  • Small Insertions and Deletions (INDELs) in Human Genomes
    JM Mullaney et al. Hum Mol Genet 19 (R2), R131-6. PMID 20858594. - Review
    In this review, we focus on progress that has been made with detecting small insertions and deletions (INDELs) in human genomes. Over the past decade, several million sma …
  • Genomic Analysis in the Age of Human Genome Sequencing
    T Lappalainen et al. Cell 177 (1), 70-84. PMID 30901550. - Review
    Affordable genome sequencing technologies promise to revolutionize the field of human genetics by enabling comprehensive studies that interrogate all classes of genome va …
See all similar articles

Cited by 3,548 PubMed Central articles

See all "Cited by" articles

References

    1. Tennessen JA, et al. Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes. Science. 2012 doi:10.1126/science.1219240. - PMC - PubMed
    1. The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi:10.1038/nature09534. - PMC - PubMed
    1. Drmanac R, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327:78–81. doi:10.1126/science.1181498. - PubMed
    1. Mills RE, et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470:59–65. doi:10.1038/nature09708. - PMC - PubMed
    1. Marth GT, et al. The functional spectrum of low-frequency coding variation. Genome Biol. 2011;12:R84. doi:10.1186/gb-2011-12-9-r84. - PMC - PubMed

Publication types

Substances

Grant support

Feedback