Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jun 15;546(7658):370-375.
doi: 10.1038/nature22403. Epub 2017 May 10.

Common Genetic Variation Drives Molecular Heterogeneity in Human iPSCs

Free PMC article

Common Genetic Variation Drives Molecular Heterogeneity in Human iPSCs

Helena Kilpinen et al. Nature. .
Free PMC article

Erratum in

  • Corrigendum: Common genetic variation drives molecular heterogeneity in human iPSCs.
    Kilpinen H, Goncalves A, Leha A, Afzal V, Alasoo K, Ashford S, Bala S, Bensaddek D, Casale FP, Culley OJ, Danecek P, Faulconbridge A, Harrison PW, Kathuria A, McCarthy D, McCarthy SA, Meleckyte R, Memari Y, Moens N, Soares F, Mann A, Streeter I, Agu CA, Alderton A, Nelson R, Harper S, Patel M, White A, Patel SR, Clarke L, Halai R, Kirton CM, Kolb-Kokocinski A, Beales P, Birney E, Danovi D, Lamond AI, Ouwehand WH, Vallier L, Watt FM, Durbin R, Stegle O, Gaffney DJ. Kilpinen H, et al. Nature. 2017 Jun 29;546(7660):686. doi: 10.1038/nature23012. Epub 2017 Jun 14. Nature. 2017. PMID: 28614302


Technology utilizing human induced pluripotent stem cells (iPS cells) has enormous potential to provide improved cellular models of human disease. However, variable genetic and phenotypic characterization of many existing iPS cell lines limits their potential use for research and therapy. Here we describe the systematic generation, genotyping and phenotyping of 711 iPS cell lines derived from 301 healthy individuals by the Human Induced Pluripotent Stem Cells Initiative. Our study outlines the major sources of genetic and phenotypic variation in iPS cells and establishes their suitability as models of complex human traits and cancer. Through genome-wide profiling we find that 5-46% of the variation in different iPS cell phenotypes, including differentiation capacity and cellular morphology, arises from differences between individuals. Additionally, we assess the phenotypic consequences of genomic copy-number alterations that are repeatedly observed in iPS cells. In addition, we present a comprehensive map of common regulatory variants affecting the transcriptome of human pluripotent cells.

Conflict of interest statement

Author information

Reprints and permissions information is available at

+ Competing financial information statement.

Details of the data generated during the project, including archive accession identifiers for obtaining the data, are described in the Supplementary Information. The HipSci website ( also has full details of all publicly available data and instructions for researchers to apply for access to data in European Genome-phenome Archive (EGA). Reprints and permissions information is available at The authors declare no competing financial interests.


Extended Data Figure 1
Extended Data Figure 1. Overview of the Cellomics assay.
(a) Example plate layout for the cellular differentiation assay. Images are shown for the pluripotency markers (Oct4, Sox2, and Nanog) as they are measured in the Cellomics imaging device. Each line is measured in two rows of the same plate as technical replicates. The secondary antibody used for each marker is shown in parenthesis. Each plate also has measurements for staining with the secondary antibody only, which serves as a means to assess background fluorescence. The red channel shows the signal from the DAPI staining, the green channel the marker signal. As expected, there is only little signal from the green channel in the wells stained only for the secondary antibody. Image acquisition stops as soon as 10,000 cells have been detected. (b) Detailed variance components of the Cellomics markers (Methods). Substantial proportions of the marker variance could be attributed to batch factors, including staining, technician effects and antibody lots. These effects mean that the fraction of cells expressing particular markers need to be interpreted with caution (Fig. 1c,d). (c) Pairwise correlation between quantitative expression scores derived from immunostaining for pluripotency and differentiation and the PluriTest score.
Extended Data Figure 2
Extended Data Figure 2. Pluritest scores in the two culture conditions
(a-c) Comparison of PluriTest novelty score versus pluripotency score for the 711 lines generated. Lines grown on feeder-free conditions (E8 media) scored systematically lower than Feeder-dependent lines (P = 1.62x10-43 t-test, for pluripotency score). We note that, while we cannot rule out that Feeder-free lines are less pluripotent, Feeder-free conditions are not well represented in the PluriTest training dataset, which may explain this result (of the 204 ESC/IPSC lines in the pluriTest paper that have media metadata available, none were on E8 and only 37 were on a variety of other feeder free formulations such as MTSER). (d) Despite lower pluripotency scores, lines grown on Feeder-free conditions have higher fractions of cells expressing canonical protein markers of pluripotency.
Extended Data Figure 3
Extended Data Figure 3. Extended CNA analysis.
Relationship between the number of CNAs using three CNA minimum length thresholds for calling CNAs: 200 Kb, 500 Kb and 1,000 Kb and other experimental factors. Values on the x-axis have been ‘jittered’ (i.e. small random ‘noise’ has been added to the true values) to enhance the visualisation. Data points underlying the boxplots are shown as semi-transparent blue dots. (a) Number of CNAs per line versus passage number. P-values shown are from a generalized linear mixed model (Poisson regression) with donor random effect. (b) Boxplot of the number of autosomal CNAs per line versus growth media. P-values are for a Poisson regression on culture condition. (c-d) Number of autosomal CNAs per line versus PluriTest pluripotency and novelty scores. P-values are for a linear mixed model on the number of autosomal CNAs per line with a donor random effect. (e-f) Number of CNA counts per donor versus gender and donor age. CNA counts refer to the total number of unique CNAs across all lines derived from the same donor. CNAs that are shared between lines of the same donor (overlap by at least one base) are counted only once. P-values shown are for a Poisson regression on either gender or age.
Extended Data Figure 4
Extended Data Figure 4. Location and consequence of the recurrent CNA on chr20 (related to Fig. 2).
Top panel shows genomic location versus number of lines with CN three (grey) and with a CNA (black). Bottom panel shows the NAV gene score from ref and log2 gene expression fold change between the iPSC lines with CN two and three (color scale), in the region highlighted in red in the top panel. Highlighted genes are up-regulated when copy number increases, known onco/tumour-suppressor genes and/or genes with NAV score in the top 2%.
Extended Data Figure 5
Extended Data Figure 5. Functional assessment of CNAs using growth assays.
Cell growth rate (a), proliferation (b) and apoptosis (c) in cell lines with copy number two (“wild type”, blue dots) or copy number three (“mutant”, red dots) in a recurrently duplicated region in iPSCs on chromosome 1, 17 or 20. Plot titles show the donor name and the genomic coordinates of the CNA. (a) Shown are cell counts taken on successive days in culture, for pairs of lines (one mutant, one wild type) grown on the same 24-well plates. Star symbols denote significance levels for statistical interactions between day and copy number in a linear mixed model, using fixed effects to fit day and copy number, and random effects to account for culture plate effects. “EIF4A3” denotes whether a copy number variant overlaps one of the suspected candidate genes on chromosome 17. * - P < 0.05; ** - P < 0.01; *** - P < 0.001. (b) Protein expression level measured using Tandem Mass Tag (TMT)-based quantitation on the Q-exactive plus (labelled “QE Plus”) orbitrap and a fusion (labelled “Fusion”) orbitrap MS platforms. (c) Estimated fraction of fluorescing nuclei following EdU assay in mutant and wild type lines, following exposure to mitomycin ("Treated"), or in a control sample ("Untreated"). (d) Estimated fraction of fluorescing nuclei following Terminal deoxynucleotidyl transferase dUTP nick end labelling assay (TUNEL) in mutant and wild type lines, following exposure to mitomycin ("Treated"), or in a control sample ("Untreated"). Solid trend lines are least squares regression fits. P-values in b and c denote the significance of statistical interactions between copy number and mitomycin treatment condition (“Treated” or “Untreated”).
Extended Data Figure 6
Extended Data Figure 6. Effect of passage on Tier 1 and Tier 2 data and overview of iPSC cis eQTLs mapped with ‘Tier 1’ gene expression array data.
(a,b) Passage number versus PluriTest pluripotency and novelty scores shows no significant association between passage number and pluripotency. Trend lines shown are fit using linear regression of PluriTest scores on passage number (score P = 0.66, novelty P = 0.21). Association was also not deemed significant when including gender and media as fixed effects and batch variables and donor as random effects (score P = 0.3, novelty P = 0.14). (c) Passage number versus log10 RNA-seq expression of pluripotency factors Nanog and Pou5f1 (Oct4) shows no significant association between passage number and pluripotency. Trend lines are fit using linear regression of log10 expression on passage number (Nanog P = 0.5, Pou5f1 P = 0.15). Association was also not deemed significant when considering the two genes together and when including gender and media as fixed effects and batch variables and donor as random effects (passage P = 0.28, passage-gene interaction P = 0.96). (d,e) Variance component analysis for Tier 2 assays, showing that for the majority of genes gender and passage explained little of the total variance. (f,g) Comparison of eQTL effect sizes (squared beta) at lead variants of the main gexarray eQTL map (derived using mean expression levels per donor). Plotted are the effect sizes for all tested genes (FDR < 5% eGenes indicated in blue) derived from (f) iPSC line replicate sets 1 and 2, one per donor, drawn randomly (rho = 0.47 genome-wide, rho = 0.80, FDR < 5% eGenes, P < 2.2e-16; Spearman rank correlation) and (g) replicate set 1 and the main map (rho = 0.57 genome-wide, rho = 0.88, FDR < 5% eGenes, P < 2.2e-16). Panel (g) shows that the effect sizes obtained using the mean expression values per donor are higher than when using individual lines. (h) Pairwise correlation between gene expression levels in iPSCs measured with RNA-seq and gexarray. Plotted are the Spearman rank correlation coefficients of either gene (pink) or gexarray probe (blue) region based read counts, demonstrating higher correlation of probe-based counts.
Extended Data Figure 7
Extended Data Figure 7. Properties of iPSC cis eQTLs in comparison to somatic eQTLs.
Plotted is the power to detect eQTLs, comparing 44 somatic tissues from GTEx (V6p) and the HipSci RNA-seq -based eQTL map (purple triangle), considering either the absolute (a) or relative (b) number of eQTLs identified (eGenes, FDR < 5%). The major determinant of eQTL detection power is sample size. (c) Cumulative fraction of RNA-seq reads relative to the number of protein coding genes expressed. Plotted is the mean read count derived from 20 iPSC lines (10 donors, two lines each), five fibroblast lines, and two embryonic stem cell (ESC) lines. In iPSCs, half of the reads are explained by the expression of 1,071 genes, while 75% and 90% of the reads are explained by the expression of 3,159 and 5,814 genes, respectively (total protein coding genes with non-zero counts N = 17,332). (d) Distribution of iPSC eQTLs around the annotated gene start position. Plotted is the -log10 (eQTL P-value) against the distance (bp) from the gene start for lead eQTL variants genome-wide, highlighting significant eQTLs (FDR < 5%) in orange. (e) Comparison of the magnitude of eQTL effect size (absolute beta; left panel) and minor allele frequency (MAF; right panel) between iPSC-specific (N = 2,131; labelled as ‘S’) and non-specific eQTLs (N = 4,500; labelled as ‘NS’), demonstrating that overall, iPSC-specific eQTLs have smaller effects on the transcriptome than eQTLs shared among multiple tissues (P = 9.97x10-161; Wilcox test) and have a lower minor allele frequency (P = 1.08x10-35, Wilcox test).
Extended Data Figure 8
Extended Data Figure 8. Comparison of eQTL mapping pipelines between HipSci and GTEx (V6p).
(a) Proportion of tissue-specific eQTLs as a function of the discovery sample size. For iPSC, shown are the two sets of tissue-specific eQTLs obtained with the two different mapping pipelines (Methods), namely the standard HipSci pipeline (‘iPSC’; purple triangle) and the alternative ‘GTEx-like’ pipeline (‘iPSC2’; purple triangle). Points other than iPSC are from the GTEx Consortium (44 somatic tissues and cell lines) . (b) Heatmap of pairwise π1 values (π1 = 1 - π0) between iPSCs and GTEx tissues, with rows representing the discovery tissue and columns the replication tissue. Clustering of tissues is based on euclidean distance (R hclust, method=average). (c) Effect of eQTL replication threshold on the definition of tissue-specific effects. Shown is the replication profile of iPSC eQTLs across GTEx tissues relative to discovery sample size in each replication tissue. Plotted is the proportion of iPSC lead eQTLs that replicate in each tissue, with replication defined using two different replication thresholds (TH1: nominal eQTL P < 0.01/N_tissues; TH5: P < 0.05/N_tissues; plotted as dots and triangles, respectively). (d) Enrichment of alternative iPSC eQTLs (‘GTEx-like”) at promoter proximal and distal (defined as less than or greater than 2 Kb from the transcription start site) transcription factor binding sites (TFBS) in H1-hES cells from the ENCODE Project . Fold enrichments per factor are shown for iPSC-specific and non-specific eQTLs (minimum 10 observed overlaps) (Methods). Pluripotency-associated factors are indicated with an asterisk. The profile of enrichments is comparable to that obtained with the standard HipSci pipeline (Fig. 4d).
Extended Data Figure 9
Extended Data Figure 9. iPSC eQTLs and disease.
(a) Cumulative number of cancer genes (COSMIC cancer census 27/04/2016; Ngenes = 571 20) regulated by eQTLs in iPSCs, somatic tissues (GTEx V6p), and three different cancers (ER positive and negative breast cancer, colorectal cancer) ,. (b) Enrichment of iPSC and somatic eQTLs (lead variants and their high-LD proxies) at disease-associated variants in the NHGRI-EBI GWAS catalogue (2016-04-10). Plotted is the fold enrichment of eQTLs over 100 random sets of matched variants for each tissue relative to eQTL discovery sample size. The tissues showing the highest fold enrichment are liver and brain (cerebellar hemisphere; ‘BrainCH’’). (c) Somatic eQTL signal for PTPN2 (Protein Tyrosine Phosphatase, Non-Receptor Type 2) locus on chromosome 18. This locus contains a colocalising association signal for PTPN2 gene expression in iPSCs and five immunological disease phenotypes (Fig. 5a). (d) Somatic eQTL signal for TERT (Telomerase Reverse Transcriptase) locus on chromosome 5 (Fig. 5b). In both (c) and (d), the lead eQTL variant locations are indicated with red and orange vertical lines for iPSC and somatic tissues, respectively. The focal gene regions are indicated in solid grey and gene start positions of other protein-coding genes on the same strand with vertical grey lines.
Extended Data Figure 10
Extended Data Figure 10. Tissue expression and alternative splicing results at the TERT locus.
(a,b) Normalised RNA-seq per-base coverage across the TERT locus stratified by rs10069690 genotype. Plotted in the full locus (a), while (b) shows a zoomed view of the region around the lead eQTL and cancer risk variant rs10069690, indicated with a dotted line on each plot. Grey regions indicate annotated exons from Ensembl v75. Coverage was computed from indexed BAM files using the coverageBed function from the bedtools (v2.25.0) . Raw coverage was divided by total library size in millions (total number of mapped reads) per sample to obtain normalised coverage, which was then averaged over samples with the same rs10069690 genotype to obtain mean normalised coverage for each genotype group. (c) Profile of TERT expression in iPSCs and across somatic tissues from GTEx. Shown are gene FPKM values obtained with RNA-SeQC (GTEx V6p). (d) Splicing-QTL of TERT. We quantified TERT intron retention rates using Leafcutter {Li, 2016 #443} and identified one alternative splicing event associated with rs10069690, the lead iPSC eQTL variant for TERT (Fig. 5b). Shown is TERT intron 4 retention ratio (PSI, percent spliced in) in iPSC lines of all individual donors stratified by their genotype at rs10069690. This variant affects the splicing of the intron where it is located, with the minor allele (T) increasing the fraction of TERT transcripts in which intron 4 is retained (P = 1.7x10-9, Bonferroni adjusted linear regression).
Figure 1
Figure 1. iPSC line generation and quality control.
Throughout light blue = not selected, dark blue = selected lines. (a) hDF: human dermal fibroblasts; dEN: differentiated endoderm; dME: differentiated mesoderm; dEC: differentiated neuroectoderm. The x-axis shows the median number of days, including freeze/thaw cycles (snowflakes), at each pipeline stage, with stage-specific success rates. (b) PluriTest pluripotency versus novelty score. (c,d) Percentage of cells expressing pluripotency and differentiation markers. (e) Cumulative distribution of number of CNAs, fraction of trisomies per chromosome (inset). (f) Relationship between CNA counts and line passage number.
Figure 2
Figure 2. Locations and consequences of recurrent CNA regions.
(a) Genomic locations of CNAs. Colours denote the significance level of recurrence. (b) Genes differentially expressed between lines with CN 2 and 3 for the recurrent chr17 CNA. Horizontal bar denotes 1% FDR threshold (Benjamini-Hochberg). (c) Top panel shows genomic location versus number of lines with CN 3 (grey) and with a CNA (black). Bottom panel shows the NAV gene score from ref and log2 gene expression fold change between the iPSC lines with CN 2 and 3 (color scale), in the region highlighted in red in the top panel. Highlighted genes are up-regulated when copy number increases, known onco/tumour-suppressor genes and/or genes with NAV score in the top 2%.
Figure 3
Figure 3. Variance component analysis of HipSci assays.
(a-c) Partitioning of variance in genomic and proteomic assays (a), differentiation and pluripotency markers (b) and cell morphology (c). Panels show total variance (left) and proportion of variance explained by donor, accounting for technical covariates (right), with numbers of lines and donors in parenthesis. For genomic assays, genes are divided into low (L), medium (M) and high (H) expression. (d) Partitioning of variance in microarray gene expression into donor, media, CNA, gender or passage number at the time of the expression assay. Left: the distribution of variance components. Middle: the number of genes where each factor explains the most variance. Right: mean expression of genes with most variance explained by a factor. (e) Donor variance component versus expression array eQTL effect sizes. Numbers denote the number of array probes in each bin.
Figure 4
Figure 4. Comparison of iPSC and somatic tissue eQTLs.
(a) Proportion of tissue-specific eQTLs in iPSCs and 44 GTEX tissues. (b) Most likely source of tissue-specific eQTLs in iPSCs (lead and secondary), testis and somatic tissues in GTEx (averaged; including cell lines, excluding testis). Breakdown: gene not expressed (red); gene expressed but no eQTL (blue); eQTL effect is driven by distinct lead variants (r2 < 0.8; green). (c) Heatmap of the fold enrichment (FE) difference between iPSC-specific and non-specific eQTLs at chromatin states from the Roadmap Epigenomics Project, shown for five aggregated clusters representing 127 cell types (SOM, somatic; PSCd, PSC-derived). Colouring: enriched for iPSC-specific eQTLs (blue), enriched for non-specific eQTLs (red). (d) Enrichment of iPSC eQTLs at promoter proximal and distal transcription factor binding sites in H1-hES cells from the ENCODE Project. Fold enrichments per factor are shown for iPSC-specific and non-specific eQTLs. Pluripotency-associated factors are indicated with an asterisk.
Figure 5
Figure 5. iPSC eQTLs tag disease-associated variation.
(a) Colocalised association signal for iPSC expression of PTPN2 (top) and five common diseases (bottom; inflammatory bowel disease, IBD; rheumatoid arthritis, RA; Crohn’s disease, CD; celiac disease, CEL; and type 1 diabetes, T1D). PP4 is the posterior probability that the disease and gene expression associations are driven by the same causal variant. (b) An iPSC-specific eQTL for TERT (rs10069690) that is associated with risk for breast, ovarian and other cancers., The lead variant is indicated with a red triangle, the focal gene region in solid grey, and other protein-coding gene start positions by vertical grey lines.

Comment in

Similar articles

See all similar articles

Cited by 99 articles

  • Using human pluripotent stem cell models to study autism in the era of big data.
    Nehme R, Barrett LE. Nehme R, et al. Mol Autism. 2020 Mar 23;11(1):21. doi: 10.1186/s13229-020-00322-9. Mol Autism. 2020. PMID: 32293529 Free PMC article. Review.
  • Contribution of unfixed transposable element insertions to human regulatory variation.
    Goubert C, Zevallos NA, Feschotte C. Goubert C, et al. Philos Trans R Soc Lond B Biol Sci. 2020 Mar 30;375(1795):20190331. doi: 10.1098/rstb.2019.0331. Epub 2020 Feb 10. Philos Trans R Soc Lond B Biol Sci. 2020. PMID: 32075552
  • Single-cell RNA-sequencing of differentiating iPS cells reveals dynamic genetic effects on gene expression.
    Cuomo ASE, Seaton DD, McCarthy DJ, Martinez I, Bonder MJ, Garcia-Bernardo J, Amatya S, Madrigal P, Isaacson A, Buettner F, Knights A, Natarajan KN; HipSci Consortium, Vallier L, Marioni JC, Chhatriwala M, Stegle O. Cuomo ASE, et al. Nat Commun. 2020 Feb 10;11(1):810. doi: 10.1038/s41467-020-14457-z. Nat Commun. 2020. PMID: 32041960 Free PMC article.
  • Genomic basis for RNA alterations in cancer.
    PCAWG Transcriptome Core Group, Calabrese C, Davidson NR, Demircioğlu D, Fonseca NA, He Y, Kahles A, Lehmann KV, Liu F, Shiraishi Y, Soulette CM, Urban L, Greger L, Li S, Liu D, Perry MD, Xiang Q, Zhang F, Zhang J, Bailey P, Erkek S, Hoadley KA, Hou Y, Huska MR, Kilpinen H, Korbel JO, Marin MG, Markowski J, Nandi T, Pan-Hammarström Q, Pedamallu CS, Siebert R, Stark SG, Su H, Tan P, Waszak SM, Yung C, Zhu S, Awadalla P, Creighton CJ, Meyerson M, Ouellette BFF, Wu K, Yang H; PCAWG Transcriptome Working Group, Brazma A, Brooks AN, Göke J, Rätsch G, Schwarz RF, Stegle O, Zhang Z; PCAWG Consortium. PCAWG Transcriptome Core Group, et al. Nature. 2020 Feb;578(7793):129-136. doi: 10.1038/s41586-020-1970-0. Epub 2020 Feb 5. Nature. 2020. PMID: 32025019 Free PMC article.
  • Integrating CRISPR Engineering and hiPSC-Derived 2D Disease Modeling Systems.
    Rehbach K, Fernando MB, Brennand KJ. Rehbach K, et al. J Neurosci. 2020 Feb 5;40(6):1176-1185. doi: 10.1523/JNEUROSCI.0518-19.2019. J Neurosci. 2020. PMID: 32024766
See all "Cited by" articles


    1. Sterneckert JL, Reinhardt P, Scholer HR. Investigating human disease using stem cell models. Nat Rev Genet. 2014;15:625–639. doi: 10.1038/nrg3764. - DOI - PubMed
    1. Kim K, et al. Epigenetic memory in induced pluripotent stem cells. Nature. 2010;467:285–290. doi: 10.1038/nature09342. - DOI - PMC - PubMed
    1. Kim K, et al. Donor cell type can influence the epigenome and differentiation potential of human induced pluripotent stem cells. Nat Biotechnol. 2011;29:1117–1119. doi: 10.1038/nbt.2052. - DOI - PMC - PubMed
    1. Lister R, et al. Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature. 2011;471:68–73. doi: 10.1038/nature09798. - DOI - PMC - PubMed
    1. Nazor KL, et al. Recurrent variations in DNA methylation in human pluripotent stem cells and their differentiated derivatives. Cell Stem Cell. 2012;10:620–634. doi: 10.1016/j.stem.2012.02.013. - DOI - PMC - PubMed

Publication types