Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 562 (7726), 203-209

The UK Biobank Resource With Deep Phenotyping and Genomic Data

Affiliations

The UK Biobank Resource With Deep Phenotyping and Genomic Data

Clare Bycroft et al. Nature.

Abstract

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.

Conflict of interest statement

J.M. is a founder and director of Gensci Ltd. P.D., G.M. and S.L. are partners in Peptide Groove LLP. G.M. and P.D. are founders and directors of Genomics Plc. The remaining authors declare no competing financial interests.

Figures

Fig. 1
Fig. 1. Summary of the UK Biobank resource and genotyping array content.
Summary of the major components of the UK Biobank resource. See Extended Data Table 1 for more details. The figure also shows a schematic representation of the different categories of content on the UK Biobank Axiom genotype array. Numbers indicate the approximate count of markers within each category, ignoring any overlap. A more detailed description of the array content is available in the UK Biobank Axiom Array Content Summary.
Fig. 2
Fig. 2. Summary of genotype data quality and content.
All plots show properties of the UK Biobank genotype data after applying quality control. a, MAF distribution based on all samples (805,426 markers). The inset shows rare markers only (MAF < 0.01). b, The distribution of the number of batch-level quality control (QC) tests that a marker fails (see Methods). For each of four MAF ranges, we show the fraction of markers that fail the specified number of batches. c, Comparison of MAF in UK Biobank with the frequency of the same allele in ExAC, among the European-ancestry participants within each study (Supplementary Information). This analysis used 91,298 overlapping markers. Each hexagonal bin is coloured according to the number of markers falling in that bin (log10 scale). The dashed red line shows x = y. The markers with very different allele frequencies seen on the top, bottom and left-hand sides of the plot comprise approximately 300 markers. This is 0.3% of all markers in the comparison (see Supplementary Information for discussion). d, Mean log2 ratios (L2R) on X and Y chromosomes for each sample, indicating probable sex chromosome aneuploidy (see Methods). There are 652 samples with a probable sex chromosome aneuploidy (indicated by crosses). Locations of clusters of individuals with different putative karyotypes are indicated by Greek symbols: λ = X0 (or mosaic XX/X0), θ = XXX, α = XXY, and π = XYY. Counts of individuals in these regions are given in Supplementary Table 2. The colours indicate different combinations of self-reported sex, and sex inferred by Affymetrix (from the genetic data). For almost all samples (99.9%), the self-reported and the inferred sex are the same, but for a small number of samples (378) they do not match (see Supplementary Information for discussion).
Fig. 3
Fig. 3. Ancestral diversity and familial relatedness.
a, Each point represents a UK Biobank participant (n = 488,377 samples) and is placed according to their principal component (PC) scores in each of the top four principal components. Colours and shapes indicate the self-reported ethnic background of each individual. See Extended Data Table 3 for proportions in each category. b, Distribution of the number of relatives that participants have in the UK Biobank cohort. The height of each bar shows the count of participants (log10 scale) with the stated number of relatives. The colours indicate the proportions of each relatedness class within a bar. c, Examples of family groups within the UK Biobank cohort. Points represent participants, and coloured lines between points indicate their inferred relationship (for example, blue lines join full siblings). The integers show the total number of family networks in the cohort (if more than one) with that same configuration, ignoring third-degree pairs.
Fig. 4
Fig. 4. Association statistics for human height.
Results (P values) of association tests between human height and genotypes using three different sets of data for chromosome 2. In ac, P values are shown on the −log10 scale, capped at 50 for visual clarity and uncorrected for multiple comparisons. Markers with −log10(P) > 50 are plotted at 50 on the y axis and shown as triangles rather than dots. Horizontal red lines denote P = 5 × 10−8. a, Results for published meta-analysis by GIANT (n = 253,288), with NCBI GWAS catalogue markers superimposed in red (plotted at the reported P values). b, Association statistics (from linear mixed model, see Methods) for UK Biobank markers in the genotype data (n = 343,321). c, Association statistics (from linear mixed model, see Methods) for UK Biobank markers in the imputed data (n = 343,321). Points coloured pink indicate genotyped markers that were used in pre-phasing and imputation. This means that most of the data at each of these markers comes from the genotyping assay. Black points (the vast majority, ~8 million) indicate fully imputed markers. d, Venn diagram of the results of counting the number of 1-Mb windows with at least one locus with P < 5 × 10−8 in the GIANT, UK Biobank genotyped and UK Biobank imputed datasets (see Methods). Percentages in brackets are the proportion of the union of such windows across all three data sources (1,215). There were only three windows contained in UK Biobank genotyped data and not the imputed data. e, Comparison of Z-scores in UK Biobank (y axis) and GIANT (x axis). Z-scores were calculated as effect size divided by standard error, but only for markers with P < 5 × 10−8 in GIANT, for a set of 575 associated regions, which we also used for the credible set analysis (see Methods). The marker with the smallest P value (in GIANT) within each region is highlighted with blue circles. The black dotted line shows x = y, and the red solid line shows the linear regression line estimated on these data. The standard error of the regression coefficient is shown in brackets. Pearson’s correlation was used to calculate the r2 value.
Extended Data Fig. 1
Extended Data Fig. 1. Summary of sample-based quality control.
ac, The three plots show heterozygosity and missing rates, which we used to flag poor quality samples (n = 488,377 samples). Panels a and b show heterozygosity for each sample before and after, respectively, correcting for ancestral background using principal components. The symbols (shapes and colours) indicate the self-reported ethnic background of each participant. Panel c shows the set of 968 samples we flagged as outliers (in red), and all other samples (in black), with shapes the same as for the other two plots. The vertical line shows the threshold we used to call samples as outliers on missing rate. In all plots missing rate data are transformed to the logit scale, but with the axis annotated with the original values.
Extended Data Fig. 2
Extended Data Fig. 2. Examples of intensity data and genotype calls for markers of different allele frequencies.
Each sub-figure shows intensity data for a single marker within six different batches. Batches labelled with the prefix ‘UKBiLEVEAX’ contain only samples typed using the UK BiLEVE Axiom array, and those with the prefix ‘batch’ contain only samples typed using the UK Biobank Axiom array. Each point represents one sample and is coloured according to the inferred genotype at the marker. The x and y axes are transformations of the intensities for probe sets targeting each of the alleles ‘A’ and ‘B’ (see Supplementary Information for definition of probe set). The ellipses indicate the location and shape of the posterior probability distribution (two-dimensional multivariate normal) of the transformed intensities for the three genotypes in the stated batch. That is, each ellipse is drawn such that it contains 85% of the probability density. See Affymetrix Axiom Genotyping Solution Data Analysis Guide for more details of Affymetrix genotype calling. The MAF of each of the markers is computed using all samples in the released UK Biobank genotype data. a, A marker with a MAF of 0.077 with well-separated genotype clusters. b, Intensities for a marker with a MAF of 0.00092 with well-separated genotype clusters. As would be expected under Hardy–Weinberg equilibrium, there are no instances of samples with the minor homozygote genotype. c, Intensities for a marker with a MAF of 0.00066, and in which the heterozygote cluster is not well separated from the large major homozygote cluster in some batches, making it more difficult to call the heterozygous genotypes confidently.
Extended Data Fig. 3
Extended Data Fig. 3. Mean principal component scores for each self-reported country of birth.
Each column shows one principal component and each element is the mean principal component score for individuals born in the labelled country, scaled by the standard deviation of the scores for that principal component. Elements in each column are only coloured if the country has a non-zero coefficient (P < 10−5; two-sided t-test) in a linear model with country of birth as predictor and principal component scores as outcome (n = 487,848 samples). Countries (rows) have been ordered using hierarchical clustering (‘hclust’ function in R). The symbols next to each country label indicate the most common ethnic background category among the participants born in that country. For example, the most common self-reported ethnic background of participants born in Sri Lanka is ‘Any other Asian background’. Countries with fewer than 20 individuals born there were excluded from this analysis.
Extended Data Fig. 4
Extended Data Fig. 4. Distribution of information scores at autosomal markers in the imputed dataset.
The top left graph shows the full distribution of the information scores. The remaining panels show distributions in tranches of MAF; MAF > 5%, 1% ≤ MAF < 5%, 0.1% ≤ MAF < 1%, 0.01% ≤ MAF < 0.1% and 0.001% ≤ MAF < 0.01%.
Extended Data Fig. 5
Extended Data Fig. 5. Example region of association in standing height GWAS.
GWAS association statistics (P values) for standing height focusing on a ~3-Mb region of chromosome 2 that did not reach genome-wide significance in the GIANT (2014) meta-analysis, but did in UK Biobank (linear mixed model; see Methods). The P values shown are not adjusted for multiple testing. Markers genotyped in the UK Biobank are shown as diamonds, and imputed markers as circles. The two markers with the smallest P value for each of the genotyped data and imputed data are enlarged and highlighted with black outlines, and other UK Biobank markers are coloured according to their correlation (r2) with one of these two. That is, genotyped markers with the leading genotyped marker (rs17713396), and imputed markers with the leading imputed marker (rs12714401). Markers with r2 values of less than 0.1 are shown as black or green.
Extended Data Fig. 6
Extended Data Fig. 6. Comparison of fine-mapping in GIANT (2014) and UK Biobank imputed data.
Here we summarize results of our credible set analysis in GIANT (2014) and UK Biobank for 575 genomics regions associated with standing height in both studies (see Methods). A red solid line on a plot indicates where x = y. a, Both plots compare the number of markers in the 95% credible sets in which the size is less than 18 markers in both studies (363 regions in the left-hand plot; 445 in the right-hand plot). b, c, Both plots are from the analysis considering all markers in each study. In b we show, for each region, the proportion of markers used in the analysis for a given study that are in the 95% credible set for that study. The plot contains the same 363 regions as shown in the left-hand plot in a. In c we summarize, for all 575 regions, how much weight our UK Biobank analysis placed on markers that our analysis of GIANT (2014) indicated were important.

Comment in

Similar articles

See all similar articles

Cited by 174 PubMed Central articles

See all "Cited by" articles

References

    1. Plenge RM, Scolnick EM, Altshuler D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 2013;12:581–594. doi: 10.1038/nrd4051. - DOI - PubMed
    1. The UK Biobank. UK Biobank Axiom Array Content Summaryhttp://www.ukbiobank.ac.uk/wp-content/uploads/2014/04/UK-Biobank-Axiom-Array-Content-Summary-2014.pdf (2014).
    1. The UK Biobank. Genotyping and Quality Control of UK Biobank, a Large-Scale, Extensively Phenotyped Prospective Resourcehttp://biobank.ctsu.ox.ac.uk/crystal/docs/genotyping_qc.pdf (2015).
    1. Young AI, Wauthier F, Donnelly P. Multiple novel gene-by-environment interactions modify the effect of FTO variants on body mass index. Nat. Commun. 2016;7:12724. doi: 10.1038/ncomms12724. - DOI - PMC - PubMed
    1. Astle WJ, et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell. 2016;167:1415–1429.e19. doi: 10.1016/j.cell.2016.10.042. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Feedback