Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 18;10(1):333.
doi: 10.1038/s41467-018-08219-1.

Apparent latent structure within the UK Biobank sample has implications for epidemiological analysis

Affiliations

Apparent latent structure within the UK Biobank sample has implications for epidemiological analysis

Simon Haworth et al. Nat Commun. .

Abstract

Large studies use genotype data to discover genetic contributions to complex traits and infer relationships between those traits. Co-incident geographical variation in genotypes and health traits can bias these analyses. Here we show that single genetic variants and genetic scores composed of multiple variants are associated with birth location within UK Biobank and that geographic structure in genotype data cannot be accounted for using routine adjustment for study centre and principal components derived from genotype data. We find that major health outcomes appear geographically structured and that coincident structure in health outcomes and genotype data can yield biased associations. Understanding and accounting for this phenomenon will be important when making inference from genotype data in large studies.

PubMed Disclaimer

Conflict of interest statement

D.J.L. is a director of and shareholder in GENSCI LTD. The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Within-UK ancestry predicts migration that confounds education: estimated educational attainment of the United Kingdom, when seen only through the ALSPAC cohort based in Bristol. Scores are 1: vocational, 2: CSEs, 3: O-levels, 4: A-levels, 5: degree. CSE Certificate of Secondary Education. The predicted mean education for each region is given, along with 95% confidence intervals estimated by bootstrap resampling of individuals. Each region is coloured by predicted mean education, where predicted mean = 2 is shaded in red and predicted mean = 5 is shaded in white. See Methods for details. ALSPAC Avon Longitudinal Study of Parents and Children
Fig. 2
Fig. 2
The relationship between polygenic scores (PS; right-hand label) and geographical terms (left-hand label) within the UK Biobank sample. Tiles are shaded by p value testing the null hypothesis of no association between PS and geographical term, where p = 0 is shaded in black and p = 2e−16 is shaded in red. Statistical adjustment was performed as follows: model 1: no adjustment; model 2: adjustment for genotyping array only; model 3: adjustment for genotyping array, 10 principal components (PCs) and study participation centre; model 4: adjustment for genotyping array, 40 PCs and study participation centre
Fig. 3
Fig. 3
Fitted spline regression plots showing the non-linear distribution of polygenic scores (PS) for educational attainment (weighted version, including variants with p < 1.0e−05) in unadjusted model (left) and model after adjustment for 40 principal components and study centre (right). The centre of major population centres is marked for reference. The shaded area represents 95% confidence intervals
Fig. 4
Fig. 4
Attenuation in linear relationship between polygenic scores (PS) and complex traits in the UK Biobank sample at varying degrees of statistical adjustment. N sibs refers to number of siblings. For each PS, the relationship with four traits was estimated using an unadjusted model (plotted in circle) and this estimate and its corresponding 95% confidence intervals were rescaled to a value of 1. Error bars represent 95% confidence intervals for the rescaled estimate. Adjustment was then performed for genotyping array only (triangles), genotyping array, 40 principal components (PCs) and study participation centre (cross) and 40 PCs, study participation centre and non-linear regression terms for North and East axes of birth location (square). A value of 0.5 on the y-axis would mean that 50% of the unadjusted effect estimate remained after adjustment. Lines are drawn at x = 1 (red) and y = 0 (black) for reference

Similar articles

Cited by

References

    1. Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361:598–604. doi: 10.1016/S0140-6736(03)12520-2. - DOI - PubMed
    1. Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904. doi: 10.1038/ng1847. - DOI - PubMed
    1. Bouaziz M, Ambroise C, Guedj M. Accounting for population stratification in practice: a comparison of the main strategies dedicated to genome-wide association studies. PLoS One. 2011;6:e28845. doi: 10.1371/journal.pone.0028845. - DOI - PMC - PubMed
    1. Pe'er I, Yelensky R, Altshule D, Daly MJ. Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol. 2008;32:381–385. doi: 10.1002/gepi.20303. - DOI - PubMed
    1. Browning SR, Browning BL. Population structure can inflate SNP-based heritability estimates. Am. J. Hum. Genet. 2011;89:191–193. doi: 10.1016/j.ajhg.2011.05.025. - DOI - PMC - PubMed

Publication types