Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 538 (7624), 201-206

The Simons Genome Diversity Project: 300 Genomes From 142 Diverse Populations

Swapan Mallick  1   2   3 Heng Li  2 Mark Lipson  1 Iain Mathieson  1 Melissa Gymrek  2   4   5   6 Fernando Racimo  7 Mengyao Zhao  1   2   3 Niru Chennagiri  1   2   3 Susanne Nordenfelt  1   2   3 Arti Tandon  1   2 Pontus Skoglund  1   2 Iosif Lazaridis  1   2 Sriram Sankararaman  1   2 Qiaomei Fu  1   2   8 Nadin Rohland  1   2 Gabriel Renaud  9 Yaniv Erlich  6   10   11 Thomas Willems  6   12 Carla Gallo  13 Jeffrey P Spence  14 Yun S Song  15   16   17 Giovanni Poletti  13 Francois Balloux  18 George van Driem  19 Peter de Knijff  20 Irene Gallego Romero  21   22 Aashish R Jha  23 Doron M Behar  24 Claudio M Bravi  25 Cristian Capelli  26 Tor Hervig  27 Andres Moreno-Estrada  28 Olga L Posukh  29   30 Elena Balanovska  31 Oleg Balanovsky  31   32   33 Sena Karachanak-Yankova  34 Hovhannes Sahakyan  24   35 Draga Toncheva  34 Levon Yepiskoposyan  35 Chris Tyler-Smith  36 Yali Xue  36 M Syafiq Abdullah  37 Andres Ruiz-Linares  38 Cynthia M Beall  39 Anna Di Rienzo  23 Choongwon Jeong  23 Elena B Starikovskaya  40 Ene Metspalu  24   41 Jüri Parik  24 Richard Villems  24   41   42 Brenna M Henn  43 Ugur Hodoglugil  44 Robert Mahley  45 Antti Sajantila  46 George Stamatoyannopoulos  47 Joseph T S Wee  48 Rita Khusainova  49   50 Elza Khusnutdinova  49   50 Sergey Litvinov  24   49   50 George Ayodo  51 David Comas  52 Michael F Hammer  53 Toomas Kivisild  24   54 William Klitz  6 Cheryl A Winkler  55 Damian Labuda  56 Michael Bamshad  57 Lynn B Jorde  58 Sarah A Tishkoff  59 W Scott Watkins  60 Mait Metspalu  24 Stanislav Dryomov  40   61 Rem Sukernik  40   62 Lalji Singh  63 Kumarasamy Thangaraj  63 Svante Pääbo  9 Janet Kelso  9 Nick Patterson  2 David Reich  1   2   3

The Simons Genome Diversity Project: 300 Genomes From 142 Diverse Populations

Swapan Mallick et al. Nature.


Here we report the Simons Genome Diversity Project data set: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioural modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that of other non-Africans.


Extended Data Figure 1
Extended Data Figure 1. Heatmap of fraction of heterozygous sites missed in the 1000 Genomes Project
For each sample, we examine all heterozygous sites passing filter level 1, and compute the fraction included as known polymorphisms in the 1000 Genomes Project.
Extended Data Figure 2
Extended Data Figure 2. Worldwide variation in human short tandem repeats
A: Mean STR length is reported as the average of the length difference (in base pairs) from the GRCh37 reference for each genotype. Bubble area scales with the number of calls compared at each point. B: and C: show the first two principal components after performing principal component analysis on tetranucleotide and homopolymer genotypes, respectively. Colors represent the region of origin of each sample. D: Pairwise FST values between populations computed using only SNPs vs. using combined SNP+STR loci. E: Block jackknife standard errors for the SNP vs. SNP+STR FST analysis. The red dashed lines give the best-fit line, described by the formula in red. The black dashed line denotes the diagonal.
Extended Data Figure 3
Extended Data Figure 3. ADMIXTURE analysis
We carried out unsupervised ADMIXTURE 1.23, analysis over the 300 SGDP individuals in 20 replicates with randomly chosen initial seeds, varying the number of ancestral populations between K=2 and K=12 and using default 5-fold cross-validation (--cv flag). We used genotypes of at least filter level 1, and restricted analysis to sites where at least two individuals carried the variant allele (as singleton variants are non-informative for population clustering). After further filtering sites with at least 99% completeness and performing linkage-disequilibrium based pruning in PLINK 1.9, with parameters (--indep-pairwise 1000 100 0.2), a total of 482,515 single nucleotide polymorphisms remained. This figure shows the highest likelihood replicate for each value of K. We found that log likelihood monotonically increases with K, while the value K=5 minimizes cross-validation error (not shown). The solution at K=5 corresponds to major continental groups (Sub-Saharan Africans, Oceanians, East Asians, Native Americans, and West Eurasians), but we show the full range of K here as they illustrate finer-scale population structure that may be useful to users of the data.
Extended Data Figure 4
Extended Data Figure 4. Principal component analysis and neighbor joining tree
A: Principal component analysis. B: Neighbor-joining tree based on FST values for all populations with at least two samples.
Extended Data Figure 5
Extended Data Figure 5. Fewer accumulated mutations in Africans than in non-Africans confirmed by mapping to chimpanzee
We compute a statistic D(Population A, Population B, Chimp), measuring the difference in the rate of matching to chimpanzee in Population A compared to Population B. The evidence of mismatching to chimpanzee is seen when we restrict to the male X chromosome to eliminate possible effects due to differences in heterozygosity across populations, and map to the chimpanzee genome which is phylogenetically symmetrically related to all present-day humans. We find that in 78 randomly chosen Population A = African and Population B = non-African pairs of males, transversion substitutions show no consistent skew from zero, but transition substitutions do.
Extended Data Figure 6
Extended Data Figure 6. 3P-CLR scan for positive selection
The red line denotes the 99.9% quantile cutoff. The genes in the top 5 regions are labeled. A: Scan for selection on the San terminal branch. B: Scan for selection on the non-San terminal branch. C: Scan for selection on the ancestral modern human branch.
Extended Data Figure 7
Extended Data Figure 7. Scan for genomic locations where the great majority of present-day humans share a recent common ancestor
We carried out PSMC analysis on 40 pairs of haploid genomes chosen to sample some of the most deeply divergent present-day human lineages. We recorded the time since the most recent common ancestor (TMRCA) at each position, and rescaled to obtain an estimate of absolute time (Supplementary Information section 12). A: Distribution across the genome of the fraction of TMRCAs below specified date cutoffs. For the 100 kya cutoff, the maximum fraction observed anywhere in the genome is 68%. B: Distribution across the genome of the date T at which specified fractions of sample pairs are inferred to have a TMRCA less than T. C: Percentile points of the cumulative distribution function of B.
Figure 1
Figure 1. Genetic variation in the SGDP
A: Neighbor-joining tree of relationships based on pairwise divergence. B: Plot of autosomal heterozygosity against the X-to-autosome heterozygosity ratio, showing the reduction in this ratio in non-Africans and Pygmies. C: Estimate of Neanderthal ancestry with a heatmap scale of 0–3%. D: Estimate of Denisovan ancestry with a heatmap scale of 0–0.5% to bring out subtle differences in mainland Eurasia (Oceanian groups with as much as 5% Denisovan ancestry are saturated in bright red).
Figure 2
Figure 2. Cross-coalescence rates and effective population sizes for selected population pairs
A–C: Cross-coalescence rates as a function of time in thousands of years ago (kya) estimated using MSMC, with four haplotypes per pair. In each subfigure legend, we give the point estimate of the date at which 25%, 50% and 75% of lineages in the pair of populations have coalesced into a common ancestral population. We generated these plots using data phased with the 1000 Genomes reference panel (method PS1 described in supplementary information section 9), but only show pairs of populations for which the cross-coalescence rates are relatively insensitive to the phasing approach. A: Selected African cross-coalescence rates. B: Central African rainforest hunter-gatherer cross-coalescence rates. C: Ancient non-African cross coalescence rates. D–F: Effective population sizes inferred using PSMC, using one diploid genome per population, for the same populations that we used in A–C.
Figure 3
Figure 3. Present-day populations have negligible ancestry from an early dispersal of modern humans out of Africa
Best-fitting admixture graph model of relationships among Australians, New Guineans, Andamanese and other diverse populations. Present-day populations are shown in blue, ancient samples in red, and select inferred ancestral nodes in green. Dotted lines indicate admixture events, all of which involve archaic humans. All f-statistic relationships are accurately fit to within 2.1 standard errors. (Inset) Results of adding putative early dispersal admixture to the graph model for different assumptions about when the early lineage split off. We specify the split time in terms of the genetic drift above the "Non-African" node, with 0.01 units of drift representing on the order of ten thousand years. The (approximate) model likelihood is maximized with zero early dispersal ancestry, and no more than a few percent is consistent with the data.

Comment in

Similar articles

  • A genomic history of Aboriginal Australia.
    Malaspinas AS, Westaway MC, Muller C, Sousa VC, Lao O, Alves I, Bergström A, Athanasiadis G, Cheng JY, Crawford JE, Heupink TH, Macholdt E, Peischl S, Rasmussen S, Schiffels S, Subramanian S, Wright JL, Albrechtsen A, Barbieri C, Dupanloup I, Eriksson A, Margaryan A, Moltke I, Pugach I, Korneliussen TS, Levkivskyi IP, Moreno-Mayar JV, Ni S, Racimo F, Sikora M, Xue Y, Aghakhanian FA, Brucato N, Brunak S, Campos PF, Clark W, Ellingvåg S, Fourmile G, Gerbault P, Injie D, Koki G, Leavesley M, Logan B, Lynch A, Matisoo-Smith EA, McAllister PJ, Mentzer AJ, Metspalu M, Migliano AB, Murgha L, Phipps ME, Pomat W, Reynolds D, Ricaut FX, Siba P, Thomas MG, Wales T, Wall CM, Oppenheimer SJ, Tyler-Smith C, Durbin R, Dortch J, Manica A, Schierup MH, Foley RA, Lahr MM, Bowern C, Wall JD, Mailund T, Stoneking M, Nielsen R, Sandhu MS, Excoffier L, Lambert DM, Willerslev E. Malaspinas AS, et al. Nature. 2016 Oct 13;538(7624):207-214. doi: 10.1038/nature18299. Epub 2016 Sep 21. Nature. 2016. PMID: 27654914
  • Genomic analyses inform on migration events during the peopling of Eurasia.
    Pagani L, Lawson DJ, Jagoda E, Mörseburg A, Eriksson A, Mitt M, Clemente F, Hudjashov G, DeGiorgio M, Saag L, Wall JD, Cardona A, Mägi R, Wilson Sayres MA, Kaewert S, Inchley C, Scheib CL, Järve M, Karmin M, Jacobs GS, Antao T, Iliescu FM, Kushniarevich A, Ayub Q, Tyler-Smith C, Xue Y, Yunusbayev B, Tambets K, Mallick CB, Saag L, Pocheshkhova E, Andriadze G, Muller C, Westaway MC, Lambert DM, Zoraqi G, Turdikulova S, Dalimova D, Sabitov Z, Sultana GNN, Lachance J, Tishkoff S, Momynaliev K, Isakova J, Damba LD, Gubina M, Nymadawa P, Evseeva I, Atramentova L, Utevska O, Ricaut FX, Brucato N, Sudoyo H, Letellier T, Cox MP, Barashkov NA, Skaro V, Mulahasanovic L, Primorac D, Sahakyan H, Mormina M, Eichstaedt CA, Lichman DV, Abdullah S, Chaubey G, Wee JTS, Mihailov E, Karunas A, Litvinov S, Khusainova R, Ekomasova N, Akhmetova V, Khidiyatova I, Marjanović D, Yepiskoposyan L, Behar DM, Balanovska E, Metspalu A, Derenko M, Malyarchuk B, Voevoda M, Fedorova SA, Osipova LP, Lahr MM, Gerbault P, Leavesley M, Migliano AB, Petraglia M, Balanovsky O, Khusnutdinova EK, Metspalu E, Thomas MG, Manica A, Nielsen R, Villems R, Willerslev E, Kivisild T, Metspalu M. Pagani L, et al. Nature. 2016 Oct 13;538(7624):238-242. doi: 10.1038/nature19792. Epub 2016 Sep 21. Nature. 2016. PMID: 27654910 Free PMC article.
  • Genome-wide data substantiate Holocene gene flow from India to Australia.
    Pugach I, Delfin F, Gunnarsdóttir E, Kayser M, Stoneking M. Pugach I, et al. Proc Natl Acad Sci U S A. 2013 Jan 29;110(5):1803-8. doi: 10.1073/pnas.1211927110. Epub 2013 Jan 14. Proc Natl Acad Sci U S A. 2013. PMID: 23319617 Free PMC article.
  • Tracing the peopling of the world through genomics.
    Nielsen R, Akey JM, Jakobsson M, Pritchard JK, Tishkoff S, Willerslev E. Nielsen R, et al. Nature. 2017 Jan 18;541(7637):302-310. doi: 10.1038/nature21347. Nature. 2017. PMID: 28102248 Free PMC article. Review.
  • Archaic human genomics.
    Disotell TR. Disotell TR. Am J Phys Anthropol. 2012;149 Suppl 55:24-39. doi: 10.1002/ajpa.22159. Epub 2012 Nov 2. Am J Phys Anthropol. 2012. PMID: 23124308 Review.
See all similar articles

Cited by 207 articles

See all "Cited by" articles


    1. Genomes Project C, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. - PMC - PubMed
    1. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–595. - PMC - PubMed
    1. McKenna A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research. 2010;20:1297–1303. - PMC - PubMed

Publication types

MeSH terms