Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun;197(2):573-89.
doi: 10.1534/genetics.114.164350. Epub 2014 Apr 2.

fastSTRUCTURE: variational inference of population structure in large SNP data sets

Affiliations

fastSTRUCTURE: variational inference of population structure in large SNP data sets

Anil Raj et al. Genetics. 2014 Jun.

Abstract

Tools for estimating population structure from genetic data are now used in a wide variety of applications in population genetics. However, inferring population structure in large modern data sets imposes severe computational challenges. Here, we develop efficient algorithms for approximate inference of the model underlying the STRUCTURE program using a variational Bayesian framework. Variational methods pose the problem of computing relevant posterior distributions as an optimization problem, allowing us to build on recent advances in optimization theory to develop fast inference tools. In addition, we propose useful heuristic scores to identify the number of populations represented in a data set and a new hierarchical prior to detect weak population structure in the data. We test the variational algorithms on simulated data and illustrate using genotype data from the CEPH-Human Genome Diversity Panel. The variational algorithms are almost two orders of magnitude faster than STRUCTURE and achieve accuracies comparable to those of ADMIXTURE. Furthermore, our results show that the heuristic scores for choosing model complexity provide a reasonable range of values for the number of populations represented in the data, with minimal bias toward detecting structure when it is very weak. Our algorithm, fastSTRUCTURE, is freely available online at http://pritchardlab.stanford.edu/structure.html.

Keywords: population structure; variational inference.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Accuracy of different algorithms as a function of resolvability of population structure. (A) Demographic model underlying the three populations represented in the simulated data sets. The edge weights quantify the amount of drift from the ancestral population. (B and C) Resolvability is a scalar by which the population-specific drifts in the demographic model are multiplied, with higher values of resolvability corresponding to stronger structure. (B) Compares the optimal model complexity given the data, averaged over 50 replicates, inferred by ADMIXTURE (Kcv), fastSTRUCTURE with simple prior (Kcv,K,KC), and fastSTRUCTURE with logistic prior (Kcv). (C) Compares the accuracy of admixture proportions, averaged over replicates, estimated by each algorithm at the optimal value of K in each replicate.
Figure 2
Figure 2
Accuracy of different algorithms as a function of the true number of populations. The demographic model is a star-shaped genealogy with populations having undergone equal amounts of drift. Subfigures A and C correspond to strong structure (F = 0.04) and B and D to weak structure (F = 0.01). (A and B) Compare the optimal model complexity estimated by the different algorithms using various metrics, averaged over 50 replicates, to the true number of populations represented in the data. Notably, when population structure is weak, both ADMIXTURE and fastSTRUCTURE fail to detect structure when the number of populations is too large. (C and D) Compare the accuracy of admixture proportions estimated by each algorithm at the optimal model complexity for each replicate.
Figure 3
Figure 3
Accuracy of different algorithms as a function of model complexity (K) on two simulated data sets, one in which ancestry is easy to resolve (A; r = 1) and one in which ancestry is difficult to resolve: (B; r = 0.5) Solid lines correspond to parameter estimates computed with a convergence criterion of |Δℰ| < 10−8, while the dashed lines correspond to a weaker criterion of |Δℰ| < 10−6. (Left) Mean admixture divergence between the true and inferred admixture proportions; (middle) mean binomial deviance of held-out genotype entries. Note that for values of K greater than the optimal value, any change in prediction error lies within the standard error of estimates of prediction error suggesting that we should choose the smallest value of model complexity above which a decrease in prediction error is statistically insignificant. (Right) Approximations to the marginal likelihood of the data computed by STRUCTURE and fastSTRUCTURE.
Figure 4
Figure 4
Visualizing ancestry proportions estimated by different algorithms on two simulated data sets, one with strong structure (top, r = 1) and one with weak structure (bottom, r = 0.5). (Left and middle) Ancestry estimated at model complexity of K = 3 and K = 5, respectively. Insets illustrate the true ancestry and the ancestry inferred by each algorithm. Each color represents a population and each individual is represented by a vertical line partitioned into colored segments whose lengths represent the admixture proportions from K populations. (Right) Mean ancestry contributions of the model components, when the model complexity K = 5.
Figure 5
Figure 5
Runtimes of different algorithms on simulated data sets with different number of loci and samples; the square root of runtime (in minutes) is plotted as a function of square root of problem size (defined as N × L × K). Similar to Figure 3, dashed lines correspond to a weaker convergence criterion than solid lines.
Figure 6
Figure 6
Ancestry proportions inferred by ADMIXTURE and fastSTRUCTURE (with the simple prior) on the HGDP data at K = 7 (Li et al. 2008). Notably, ADMIXTURE splits the Central and South American populations into two groups while fastSTRUCTURE assigns higher approximate marginal likelihood to a split of sub-Saharan African populations into two groups.
Figure 7
Figure 7
Model choice of ADMIXTURE and fastSTRUCTURE (with the simple prior) on the HGDP data. Optimal value of K, identified by ADMIXTURE using deviance residuals, and by fastSTRUCTURE using deviance, KC, and LLBO, are shown by a dashed line.
Figure 8
Figure 8
Ancestry proportions inferred by ADMIXTURE and fastSTRUCTURE (with the simple prior) at the optimal choice of K identified by relevant metrics for each algorithm. Notably, the admixture proportions at K=K and K=KC are quite similar, with estimates in the latter case identifying the Kalash and Karitiana as additional separate groups that share very little ancestry with the remaining populations.

Comment in

Similar articles

Cited by

References

    1. Alexander D. H., Novembre J., Lange K., 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19(9): 1655–1664. - PMC - PubMed
    1. Beal, M. J., 2003 Variational algorithms for approximate Bayesian inference. Ph.D. Thesis, Gatsby Computational Neuroscience Unit, University College London, London.
    1. Blei D. M., Ng A. Y., Jordan M. I., 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022.
    1. Carbonetto P., Stephens M., 2012. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7(1): 73–108.
    1. Catchen J., Bassham S., Wilson T., Currey M., O’Brien C., et al. , 2013. The population structure and recent colonization history of Oregon threespine stickleback determined using restriction-site associated DNA-sequencing. Mol. Ecol. 22: 2864–2883. - PMC - PubMed

Publication types