Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jul 2;3(7):e2551.
doi: 10.1371/journal.pone.0002551.

Population substructure and control selection in genome-wide association studies

Affiliations
Free PMC article

Population substructure and control selection in genome-wide association studies

Kai Yu et al. PLoS One. .
Free PMC article

Abstract

Determination of the relevance of both demanding classical epidemiologic criteria for control selection and robust handling of population stratification (PS) represents a major challenge in the design and analysis of genome-wide association studies (GWAS). Empirical data from two GWAS in European Americans of the Cancer Genetic Markers of Susceptibility (CGEMS) project were used to evaluate the impact of PS in studies with different control selection strategies. In each of the two original case-control studies nested in corresponding prospective cohorts, a minor confounding effect due to PS (inflation factor lambda of 1.025 and 1.005) was observed. In contrast, when the control groups were exchanged to mimic a cost-effective but theoretically less desirable control selection strategy, the confounding effects were larger (lambda of 1.090 and 1.062). A panel of 12,898 autosomal SNPs common to both the Illumina and Affymetrix commercial platforms and with low local background linkage disequilibrium (pair-wise r(2)<0.004) was selected to infer population substructure with principal component analysis. A novel permutation procedure was developed for the correction of PS that identified a smaller set of principal components and achieved a better control of type I error (to lambda of 1.032 and 1.006, respectively) than currently used methods. The overlap between sets of SNPs in the bottom 5% of p-values based on the new test and the test without PS correction was about 80%, with the majority of discordant SNPs having both ranks close to the threshold. Thus, for the CGEMS GWAS of prostate and breast cancer conducted in European Americans, PS does not appear to be a major problem in well-designed studies. A study using suboptimal controls can have acceptable type I error when an effective strategy for the correction of PS is employed.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. A diagram for the three main sets of SNPs used in the text.
The first set of PCA SNPs is used to identify hidden population substructure. The set of genomic control SNPs is used to evaluate the over-dispersion factor in a given study, as well as in the proposed permutation procedure to select relevant PCs for the correction of PS. The second set of PCA SNPs is used to validate findings from the first set of PCA SNPs. In applications, only the first set of PCA SNPs is recommended.
Figure 2
Figure 2. Samples represented by their first two principal components.
Principal components (PC, the 1st along the horizontal direction, the 2nd along the vertical direction)) were obtained by applying the PCA on the joint sample of PLCO prostate cancer and NHS breast cancer studies. A) First two PCs for subjects from the PLCO prostate cancer study. B) First two PCs for subjects from the NHS breast cancer study.
Figure 3
Figure 3. Q-Q plot based on the test without PC adjustment.
For each of the four analyses, the Q-Q plot is based on P-values (in log10 scale) that correspond to the 1 d.f. Wald test on 475,116 testing autosomal SNPs by assuming an additive risk model (in logit scale) and without PC adjustment. A) Results for the original prostate cancer study (prostate cancer cases and controls from PLCO). B) Result for the reconstructed prostate cancer study using external controls (prostate cancer cases from PLCO, and external controls from NHS). C) Results for the original breast cancer study (breast cancer cases and controls from NHS). D) Results for the reconstructed breast cancer study using external controls (breast cancer cases from NHS, and external controls from PLCO).
Figure 4
Figure 4. Q-Q plot based on the test with PC adjustment.
For each of the four analyses, the Q-Q plot is based on P-values (in log10 scale) that correspond to the 1 d.f. Wald test on 475,116 testing autosomal SNPs by assuming an additive risk model (in logit scale) and with PC adjustment. The PCs used in adjustment are selected by the proposed permutation procedure. A) Results for the original prostate cancer study (prostate cancer cases and controls from PLCO). B) Results for the reconstructed prostate cancer study using external controls (prostate cancer cases from PLCO, and external controls from NHS). C) Results for the original breast cancer study (breast cancer cases and controls from NHS). D) Results for the reconstructed breast cancer study using external controls (breast cancer cases from NHS, and external controls from PLCO).
Figure 5
Figure 5. SNP ranking correlation in prostate cancer studies.
In each plot, SNPs' rankings based on the 1 d.f. Wald test on 475,116 testing autosomal SNPs without PC adjustment are compared with their rankings based on the 1 d.f. Wald test with adjustment for PCs chosen by the permutation procedure. The SNPs in blue are ranked among the top 5% by tests both with and without PC adjustment. The SNPs in green and orange are ranked among the top 5% by only one of the tests. A) Results based on the original prostate cancer study (prostate cancer cases and controls from PLCO). The 1st PC was chosen for PS correction. B) Results based on the reconstructed prostate cancer study using external controls (prostate cancer cases from PLCO, and external controls from NHS). The 1st, 2nd and 4th PCs were chosen for PS correction.
Figure 6
Figure 6. The conditional ranking distribution for the original PLCO prostate cancer study.
Each plot shows the histogram of ranks according to the test without PC adjustment for SNPs ranked within a given range by the test with the adjustment for the 1st PC (chosen by the proposed permutation procedure). The ranking ranges (%) are shown on the horizontal axis. The frequencies (%) are shown on the vertical axis. A) The histogram of ranks for SNPs ranked in the top 0–1% by the test with PC adjustment. B) The histogram of ranks for SNPs ranked in the top 1–2% by the test with PC adjustment. C) The histogram of ranks for SNPs ranked in the top 2–3% by the test with PC adjustment. D) The histogram of ranks for SNPs ranked in the top 3–4% by the test with PC adjustment. E) The histogram of ranks for SNPs ranked in the top 4–5% by the test with PC adjustment.
Figure 7
Figure 7. The conditional ranking distribution for the reconstructed prostate cancer study using external controls.
Each plot shows the histogram of ranks according to the test without PC adjustment for SNPs ranked within a given range by the test with the adjustment for the 1st, 2nd, and 4th PCs (chosen by the proposed permutation procedure). The ranking ranges (%) are shown on the horizontal axis. The frequencies (%) are shown on the vertical axis. A) The histogram of ranks for SNPs ranked in the top 0–1% by the test with PC adjustment. B) The histogram of ranks for SNPs ranked in the top 1–2% by the test with PC adjustment. C) The histogram of ranks for SNPs ranked in the top 2–3% by the test with PC adjustment. D) The histogram of ranks for SNPs ranked in the top 3–4% by the test with PC adjustment. E) The histogram of ranks for SNPs ranked in the top 4–5% by the test with PC adjustment.

Similar articles

See all similar articles

Cited by 73 articles

  • Polygenic risk score for the prediction of breast cancer is related to lesser terminal duct lobular unit involution of the breast.
    Bodelon C, Oh H, Derkach A, Sampson JN, Sprague BL, Vacek P, Weaver DL, Fan S, Palakal M, Papathomas D, Xiang J, Patel DA, Linville L, Clare SE, Visscher DW, Mies C, Hewitt SM, Brinton LA, Storniolo AMV, He C, Chanock SJ, Garcia-Closas M, Gierach GL, Figueroa JD. Bodelon C, et al. NPJ Breast Cancer. 2020 Sep 7;6:41. doi: 10.1038/s41523-020-00184-7. eCollection 2020. NPJ Breast Cancer. 2020. PMID: 32964115 Free PMC article.
  • Low-frequency variation near common germline susceptibility loci are associated with risk of Ewing sarcoma.
    Lin SH, Sampson JN, Grünewald TGP, Surdez D, Reynaud S, Mirabeau O, Karlins E, Rubio RA, Zaidi S, Grossetête-Lalami S, Ballet S, Lapouble E, Laurence V, Michon J, Pierron G, Kovar H, Kontny U, González-Neira A, Alonso J, Patino-Garcia A, Corradini N, Bérard PM, Miller J, Freedman ND, Rothman N, Carter BD, Dagnall CL, Burdett L, Jones K, Manning M, Wyatt K, Zhou W, Yeager M, Cox DG, Hoover RN, Khan J, Armstrong GT, Leisenring WM, Bhatia S, Robison LL, Kulozik AE, Kriebel J, Meitinger T, Metzler M, Krumbholz M, Hartmann W, Strauch K, Kirchner T, Dirksen U, Mirabello L, Tucker MA, Tirode F, Morton LM, Chanock SJ, Delattre O, Machiela MJ. Lin SH, et al. PLoS One. 2020 Sep 3;15(9):e0237792. doi: 10.1371/journal.pone.0237792. eCollection 2020. PLoS One. 2020. PMID: 32881892 Free PMC article.
  • Inherited genetic susceptibility to acute lymphoblastic leukemia in Down syndrome.
    Brown AL, de Smith AJ, Gant VU, Yang W, Scheurer ME, Walsh KM, Chernus JM, Kallsen NA, Peyton SA, Davies GE, Ehli EA, Winick N, Heerema NA, Carroll AJ, Borowitz MJ, Wood BL, Carroll WL, Raetz EA, Feingold E, Devidas M, Barcellos LF, Hansen HM, Morimoto L, Kang AY, Smirnov I, Healy J, Laverdière C, Sinnett D, Taub JW, Birch JM, Thompson P, Spector LG, Pombo-de-Oliveira MS, DeWan AT, Mullighan CG, Hunger SP, Pui CH, Loh ML, Zwick ME, Metayer C, Ma X, Mueller BA, Sherman SL, Wiemels JL, Relling MV, Yang JJ, Lupo PJ, Rabin KR. Brown AL, et al. Blood. 2019 Oct 10;134(15):1227-1237. doi: 10.1182/blood.2018890764. Blood. 2019. PMID: 31350265 Free PMC article.
  • A Powerful Method To Test Associations Between Ordinal Traits and Genotypes.
    Wang J, Ding J, Huang S, Li Q, Pan D. Wang J, et al. G3 (Bethesda). 2019 Aug 8;9(8):2573-2579. doi: 10.1534/g3.119.400293. G3 (Bethesda). 2019. PMID: 31167832 Free PMC article.
  • Childhood asthma is associated with COPD and known asthma variants in COPDGene: a genome-wide association study.
    Hayden LP, Cho MH, Raby BA, Beaty TH, Silverman EK, Hersh CP; COPDGene Investigators. Hayden LP, et al. Respir Res. 2018 Oct 29;19(1):209. doi: 10.1186/s12931-018-0890-0. Respir Res. 2018. PMID: 30373671 Free PMC article.
See all "Cited by" articles

References

    1. Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet. 2007;39:645–649. - PubMed
    1. Hunter DJ, Thomas G, Hoover RN, Chanock SJ. Scanning the horizon: what is the future of genome-wide association studies in accelerating discoveries in cancer etiology and prevention? Cancer Causes Control. 2007;18:479–484. - PubMed
    1. Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet. 2007;39:870–874. - PMC - PubMed
    1. Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, et al. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet. 2003;33:518–521. - PubMed
    1. Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst. 2000;92:1151–1158. - PubMed

Publication types

Feedback