Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jul 4;8(7):e65834.
doi: 10.1371/journal.pone.0065834. Print 2013.

Fine-scale Patterns of Population Stratification Confound Rare Variant Association Tests

Collaborators, Affiliations
Free PMC article

Fine-scale Patterns of Population Stratification Confound Rare Variant Association Tests

Timothy D O'Connor et al. PLoS One. .
Free PMC article

Abstract

Advances in next-generation sequencing technology have enabled systematic exploration of the contribution of rare variation to Mendelian and complex diseases. Although it is well known that population stratification can generate spurious associations with common alleles, its impact on rare variant association methods remains poorly understood. Here, we performed exhaustive coalescent simulations with demographic parameters calibrated from exome sequence data to evaluate the performance of nine rare variant association methods in the presence of fine-scale population structure. We find that all methods have an inflated spurious association rate for parameter values that are consistent with levels of differentiation typical of European populations. For example, at a nominal significance level of 5%, some test statistics have a spurious association rate as high as 40%. Finally, we empirically assess the impact of population stratification in a large data set of 4,298 European American exomes. Our results have important implications for the design, analysis, and interpretation of rare variant genome-wide association studies.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Schematic of demographic model used in the simulations.
Parameter values were inferred by calibrating to patterns of variation in exome data from 316 European Americans.
Figure 2
Figure 2. Rare variant association methods exhibit higher than expected rates of spurious associations.
Each square represents a confounding scenario set by different values of disease risks, parameterized by Y, and the proportions of each sampled subpopulation, parameterized by X as presented in the text. A value of 0.0 for X indicates an equal proportion of each subpopulation in the study pool and 0.00 for Y indicates an equal disease risk. Spurious association rates (SAR) lower than 5% are represented as white, with other levels signified by sequential coloration with red the lowest and blue the highest. Actual values of the SAR can be found in Figure S2 in File S1.
Figure 3
Figure 3. The effects of PCA correction on logistic CMC.
The top figure has the spurious association rate (SAR) of CMC without correcting for population structure. The middle figure shows the SAR of CMC when a single PC is included as a covariate. The bottom figure shows the SAR of CMC when 10 PCs are included as covariates. Each square represents a confounding scenario parameterized by X and Y as presented in the text. SAR lower than 5% are represented as white, with other levels signified by sequential coloration with red the lowest and blue the highest. Actual values of the SAR can be found in Figure S4 in File S1.
Figure 4
Figure 4. SAR of rare variant association methods as a function of FST
. We tested for spurious association rates at various divergence times, presented as FST estimates for comparison with European populations in HGDP (light blue shading). The various lines represent differences in disease risk according to the equations P(d = c|i = 1) = 0.02+ X and P(d = c|i = 2) = 0.02 − X. The dashed black line represents the α = 0.05 value used to determine significance and the dotted lines represent the 95% confidence intervals calculated by bootstrapping.
Figure 5
Figure 5. Correcting for population structure reduces the power of rare variant association methods.
The figure shows the power of logistic regression methods when including ten PC covariates. The x-axis shows the odds ratio (OR), where 1.0 is the null model. “No Structure” indicates simulations where power was estimated from sampling cases and controls from a single panmictic population, but still corrected for structure. The dashed black line represents α = 0.05 and the dotted lines represent the 95% bootstrap confidence intervals.
Figure 6
Figure 6. Probability of being a case as a function of PC1 and PC2.
Individuals (dots) are colored according to the logistic regression with β values scaled so that for this example an odds ratio (OR) of 5 for a distance of a fourth of the minimal and maximal values for each axis. In other words, individuals separated by a fourth of the PC distance will have an OR of 5 compared to each other. The probability of being a case is thus indicated by the color of each dot on a scale from 0.06 to 1, as indicated by the gradient (lower right corner).

Similar articles

See all similar articles

Cited by 20 articles

See all "Cited by" articles

References

    1. Pritchard JK, Stephens M, Rosenberg NA, Donnelly P (2000) Association mapping in structured populations. Am J Hum Genet 67: 170–181. - PMC - PubMed
    1. Ziv E, Burchard EG (2003) Human population structure and genetic association studies. Pharmacogenomics 4: 431–441. - PubMed
    1. Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, et al. (2005) Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet 37: 1243–1246. - PubMed
    1. Roeder K, Luca D (2009) Searching for disease susceptibility variants in structured populations. Genomics 93: 1–4. - PMC - PubMed
    1. Helgason A, Yngvad’ottir B, Hrafnkelsson B, Gulcher J, Stef’ansson K (2004) An Icelandic example of the impact of population structure on association studies. Nat Genet 37: 90–95. - PubMed

Publication types

Feedback