Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct;210(2):719-731.
doi: 10.1534/genetics.118.301336. Epub 2018 Aug 21.

Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data

Affiliations

Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data

Jonas Meisner et al. Genetics. 2018 Oct.

Abstract

We here present two methods for inferring population structure and admixture proportions in low-depth next-generation sequencing (NGS) data. Inference of population structure is essential in both population genetics and association studies, and is often performed using principal component analysis (PCA) or clustering-based approaches. NGS methods provide large amounts of genetic data but are associated with statistical uncertainty, especially for low-depth sequencing data. Models can account for this uncertainty by working directly on genotype likelihoods of the unobserved genotypes. We propose a method for inferring population structure through PCA in an iterative heuristic approach of estimating individual allele frequencies, where we demonstrate improved accuracy in samples with low and variable sequencing depth for both simulated and real datasets. We also use the estimated individual allele frequencies in a fast non-negative matrix factorization method to estimate admixture proportions. Both methods have been implemented in the PCAngsd framework available at http://www.popgen.dk/software/.

Keywords: PCA; Population structure; admixture; ancestry; genotype likelihoods; low depth; next-generation sequencing.

PubMed Disclaimer

Figures

Figure 1
Figure 1
PCA plots of the top two principal components in the simulated dataset consisting of 380 individuals and 0.4 million variable sites. The left-hand plot shows the PCA performed on the known genotypes using Equation 2. The middle plot shows the PCA performed by PCAngsd, and the right-hand plot displays the PCA performed by the ngsTools model (Equation 3).
Figure 2
Figure 2
Admixture plots for K=3 of the simulated dataset where each bar represents a single individual and the different colors reflect each of the K components. The first plot is the admixture proportions estimated in ADMIXTURE using the known genotypes, which we use as the ground-truth in our simulation studies. The second plot shows admixture proportions estimated using PCAngsd with parameter α=0 and the bottom plot using NGSadmix.
Figure 3
Figure 3
PCA plots of the top two principal components for the 1000 Genomes dataset with 193 individuals and 8 million variable sites. The left-hand plot is based on the reliable genotypes of the overlapping variable sites in the low depth NGS data, the middle plot is performed by PCAngsd and the right-hand plot is performed by the ngsTools model.
Figure 4
Figure 4
Admixture plots for K=4 of the 1000 Genomes dataset, where each bar represents a single individual and the different colors reflect each of the K components. The first plot is the admixture proportions estimated in ADMIXTURE using the reliable genotypes, the second plot shows admixture proportions estimated in PCAngsd with parameter α=1500, and the last plot is the admixture proportions estimated in NGSadmix.
Figure 5
Figure 5
PCA plots of the top four principal components for the waterbuck dataset with 73 individuals and 9.4 million variable sites. The first row displays the plots of the first and second principal components for PCAngsd and the ngsTools model, respectively, while the second row displays the plots of the third and fourth principal components.
Figure 6
Figure 6
Admixture plots for K=5 of the waterbuck dataset where each bar represents a single individual and the different colors reflect each of the K components. The first plot is the admixture proportions estimated in PCAngsd with parameter α=5000, and the second plot shows the admixture proportions estimated in NGSadmix.

Similar articles

Cited by

References

    1. Alexander D. H., Novembre J., Lange K., 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19: 1655–1664. 10.1101/gr.094052.109 - DOI - PMC - PubMed
    1. Cann H. M., De Toma C., Cazes L., Legrand M.-F., Morel V., et al. , 2002. A human genome diversity cell line panel. Science 296: 261–262. 10.1126/science.296.5566.261b - DOI - PubMed
    1. Conomos M. P., Reiner A. P., Weir B. S., Thornton T. A., 2016. Model-free estimation of recent genetic relatedness. Am. J. Hum. Genet. 98: 127–148. 10.1016/j.ajhg.2015.11.022 - DOI - PMC - PubMed
    1. Engelhardt B. E., Stephens M., 2010. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genet. 6: e1001117 10.1371/journal.pgen.1001117 - DOI - PMC - PubMed
    1. Frichot E., Mathieu F., Trouillon T., Bouchard G., François O., 2014. Fast and efficient estimation of individual ancestry coefficients. Genetics 196: 973–983. 10.1534/genetics.113.160572 - DOI - PMC - PubMed

Publication types

LinkOut - more resources