Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 8 (11), e1002967

Inference of Population Splits and Mixtures From Genome-Wide Allele Frequency Data

Affiliations

Inference of Population Splits and Mixtures From Genome-Wide Allele Frequency Data

Joseph K Pickrell et al. PLoS Genet.

Abstract

Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In our model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data, we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and "ancient" Asian breeds. Software implementing the model described here, called TreeMix, is available at http://treemix.googlecode.com.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Simple examples.
A. An example tree. B. The covariance matrix implied by the tree structure in A. Note that the covariance here is with respect to the allele frequency at the root, and that each entry has been divided by formula image to simplify the presentation. C. An example graph. The migration edge is colored red. Parental populations for population 3 are labeled formula image and formula image; see the main text for details. D. The covariance matrix implied by the graph in C; again, each entry has been divided by formula image. The migration terms are in red, and the non-migration terms are in blue.
Figure 2
Figure 2. Performance on simulated data.
A. The basic outline of the demographic model used. B. Trees inferred by TreeMix. We simulated 100 independent data sets, under the demographic model in A., and inferred the tree. All simulations gave the same topology; plotted are the mean branch lengths. C. Performance in the presence of migration. We added migration events to the tree in A. and inferred the structure of the graph. Each point represents the error rate over 100 independent simulations, defined as the fraction of simulations where the inferred graph topology does not perfectly match the simulated topology. On the x-axis we show the populations involved in the simulated migration event; e.g., if the source population is 1 and the destination population is 10, this is a migration event from population 1 to population 10, as labeled in A. D. Admixture weight estimation. We simulated admixture events with different weights from population 1 to population 10, and inferred the weight. Each point is the mean across 100 simulations, and the bar represents the range.
Figure 3
Figure 3. Inferred human tree.
A. Maximum likelihood tree. Plotted is the maximum-likelihood tree. Populations are colored according to geographic location (black: archaic humans, red: Africa, brown: Middle East, green: Europe, blue: Central Asia, purple: America, orange: East Asia). The scale bar shows ten times the average standard error of the entries in the sample covariance matrix (formula image). For analysis including Oceania, see Figures S11 and S12. B. Residual fit. Plotted is the residual fit from the maximum likelihood tree in A. We divided the residual covariance between each pair of populations formula image and formula image by the average standard error across all pairs. We then plot in each cell formula image this scaled residual. Colors are described in the palette on the right. Residuals above zero represent populations that are more closely related to each other in the data than in the best-fit tree, and thus are candidates for admixture events.
Figure 4
Figure 4. Inferred human tree with mixture events.
Plotted is the structure of the graph inferred by TreeMix for human populations, allowing ten migration events. Migration arrows are colored according to their weight. Horizontal branch lengths are proportional to the amount of genetic drift that has occurred on the branch. The scale bar shows ten times the average standard error of the entries in the sample covariance matrix (formula image). The residual fit from this graph is shown in Figure S9. Admixture from Neandertals to non-African populations is only apparent when considering subsets of the data (see Discussion and Figure S15).
Figure 5
Figure 5. Inferred dog tree.
A. Maximum likelihood tree. Populations are colored according to breed type. Dark blue: wild canids, grey: ancient breeds, brown: spitz breeds, black: toy dogs, red: spaniels, maroon: scent hounds, dark red: working dogs, light green: herding dogs, light blue: mastiff-like dogs, purple: small terriers, orange: retrievers, dark green: sight hounds. The scale bar shows ten times the average standard error of the entries in the sample covariance matrix (formula image). B. Residual fit. Plotted is the residual fit from the maximum likelihood tree in A. We divided the residual covariance between each pair of populations formula image and formula image by the average standard error across all pairs. We then plot in each cell formula image this scaled residual. Colors are described in the palette on the right.
Figure 6
Figure 6. Inferred dog graph.
Plotted is the structure of the graph inferred by TreeMix for dog populations, allowing ten migration events. Migration arrows are colored according to their weight. The scale bar shows ten times the average standard error of the entries in the sample covariance matrix (formula image). See the main text for discussion. The residual fit from this graph is presented in Figure S13.

Similar articles

See all similar articles

Cited by 452 articles

See all "Cited by" articles

References

    1. Cavalli-Sforza LL, Edwards AW (1967) Phylogenetic analysis. Models and estimation proce-dures. Am J Hum Genet 19: 233–57. - PMC - PubMed
    1. Felsenstein J (1982) How can we infer geography and history from gene frequencies? J Theor Biol 96: 9–20. - PubMed
    1. Cann RL, Stoneking M, Wilson AC (1987) Mitochondrial DNA and human evolution. Nature 325: 31–6. - PubMed
    1. Nei M, Roychoudhury AK (1974) Genic variation within and between the three major races of man, Caucasoids, Negroids, and Mongoloids. Am J Hum Genet 26: 421–43. - PMC - PubMed
    1. Nei M, Roychoudhury AK (1993) Evolutionary relationships of human populations on a global scale. Mol Biol Evol 10: 927–43. - PubMed

Publication types

LinkOut - more resources

Feedback