fastSTRUCTURE: variational inference of population structure in large SNP data sets
- PMID: 24700103
- PMCID: PMC4063916
- DOI: 10.1534/genetics.114.164350
fastSTRUCTURE: variational inference of population structure in large SNP data sets
Abstract
Tools for estimating population structure from genetic data are now used in a wide variety of applications in population genetics. However, inferring population structure in large modern data sets imposes severe computational challenges. Here, we develop efficient algorithms for approximate inference of the model underlying the STRUCTURE program using a variational Bayesian framework. Variational methods pose the problem of computing relevant posterior distributions as an optimization problem, allowing us to build on recent advances in optimization theory to develop fast inference tools. In addition, we propose useful heuristic scores to identify the number of populations represented in a data set and a new hierarchical prior to detect weak population structure in the data. We test the variational algorithms on simulated data and illustrate using genotype data from the CEPH-Human Genome Diversity Panel. The variational algorithms are almost two orders of magnitude faster than STRUCTURE and achieve accuracies comparable to those of ADMIXTURE. Furthermore, our results show that the heuristic scores for choosing model complexity provide a reasonable range of values for the number of populations represented in the data, with minimal bias toward detecting structure when it is very weak. Our algorithm, fastSTRUCTURE, is freely available online at http://pritchardlab.stanford.edu/structure.html.
Keywords: population structure; variational inference.
Copyright © 2014 by the Genetics Society of America.
Figures
Comment in
-
Variations on a common STRUCTURE: new algorithms for a valuable model.Genetics. 2014 Jul;197(3):809-11. doi: 10.1534/genetics.114.166264. Genetics. 2014. PMID: 25024035 Free PMC article. No abstract available.
Similar articles
-
De novo inference of stratification and local admixture in sequencing studies.BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S17. doi: 10.1186/1471-2105-14-S5-S17. Epub 2013 Apr 10. BMC Bioinformatics. 2013. PMID: 23734678 Free PMC article.
-
A Variational Bayes Genomic-Enabled Prediction Model with Genotype × Environment Interaction.G3 (Bethesda). 2017 Jun 7;7(6):1833-1853. doi: 10.1534/g3.117.041202. G3 (Bethesda). 2017. PMID: 28391241 Free PMC article.
-
POPSTR: Inference of Admixed Population Structure Based on Single-Nucleotide Polymorphisms and Copy Number Variations.J Comput Biol. 2018 Apr;25(4):417-429. doi: 10.1089/cmb.2017.0127. Epub 2018 Jan 2. J Comput Biol. 2018. PMID: 29293371 Free PMC article.
-
Comparison of algorithms to infer genetic population structure from unlinked molecular markers.Stat Appl Genet Mol Biol. 2014 Aug;13(4):391-402. doi: 10.1515/sagmb-2013-0006. Stat Appl Genet Mol Biol. 2014. PMID: 24964261 Review.
-
Inferring population size changes with sequence and SNP data: lessons from human bottlenecks.Heredity (Edinb). 2013 May;110(5):409-19. doi: 10.1038/hdy.2012.120. Epub 2013 Feb 20. Heredity (Edinb). 2013. PMID: 23423148 Free PMC article. Review.
Cited by
-
Exome genotyping, linkage disequilibrium and population structure in loblolly pine (Pinus taeda L.).BMC Genomics. 2016 Sep 13;17(1):730. doi: 10.1186/s12864-016-3081-8. BMC Genomics. 2016. PMID: 27624183 Free PMC article.
-
Highly parallelized laboratory evolution of wine yeasts for enhanced metabolic phenotypes.Mol Syst Biol. 2024 Oct;20(10):1109-1133. doi: 10.1038/s44320-024-00059-0. Epub 2024 Aug 22. Mol Syst Biol. 2024. PMID: 39174863 Free PMC article.
-
Landscape Genetic Connectivity and Evidence for Recombination in the North American Population of the White-Nose Syndrome Pathogen, Pseudogymnoascus destructans.J Fungi (Basel). 2021 Mar 3;7(3):182. doi: 10.3390/jof7030182. J Fungi (Basel). 2021. PMID: 33802538 Free PMC article.
-
Fingerprinting Soybean Germplasm and Its Utility in Genomic Research.G3 (Bethesda). 2015 Jul 28;5(10):1999-2006. doi: 10.1534/g3.115.019000. G3 (Bethesda). 2015. PMID: 26224783 Free PMC article.
-
Combining Hyperspectral Techniques and Genome-Wide Association Studies to Predict Peanut Seed Vigor and Explore Associated Genetic Loci.Int J Mol Sci. 2024 Aug 1;25(15):8414. doi: 10.3390/ijms25158414. Int J Mol Sci. 2024. PMID: 39125982 Free PMC article.
References
-
- Beal, M. J., 2003 Variational algorithms for approximate Bayesian inference. Ph.D. Thesis, Gatsby Computational Neuroscience Unit, University College London, London.
-
- Blei D. M., Ng A. Y., Jordan M. I., 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022.
-
- Carbonetto P., Stephens M., 2012. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7(1): 73–108.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
