Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 7, 34

On the Identification of Potential Regulatory Variants Within Genome Wide Association Candidate SNP Sets

Affiliations

On the Identification of Potential Regulatory Variants Within Genome Wide Association Candidate SNP Sets

Chih-Yu Chen et al. BMC Med Genomics.

Abstract

Background: Genome wide association studies (GWAS) are a population-scale approach to the identification of segments of the genome in which genetic variations may contribute to disease risk. Current methods focus on the discovery of single nucleotide polymorphisms (SNPs) associated with disease traits. As there are many SNPs within identified risk loci, and the majority of these are situated within non-coding regions, a key challenge is to identify and prioritize variants affecting regulatory sequences that are likely to contribute to the phenotype assessed.

Methods: We focused investigation on SNPs within lung and breast cancer GWAS loci that reached genome-wide significance for potential roles in gene regulation with a specific focus on SNPs likely to disrupt transcription factor binding sites. Within risk loci, the regulatory potential of sub-regions was classified using relevant open chromatin and epigenetic high throughput sequencing data sets from the ENCODE project in available cancer and normal cell lines. Furthermore, transcription factor affinity altering variants were predicted by comparison of position weight matrix scores between disease and reference alleles. Lastly, ChIP-seq data of transcription associated factors and topological domains were included as binding evidence and potential gene target inference.

Results: The sets of SNPs, including both the disease-associated markers and those in high linkage disequilibrium with them, were significantly over-represented in regulatory sequences of cancer and/or normal cells; however, over-representation was generally not restricted to disease-relevant tissue specific regions. The calculated regulatory potential, allelic binding affinity scores and ChIP-seq binding evidence were the three criteria used to prioritize candidates. Fitting all three criteria, we highlighted breast cancer susceptibility SNPs and a borderline lung cancer relevant SNP located in cancer-specific enhancers overlapping multiple distinct transcription associated factor ChIP-seq binding sites.

Conclusion: Incorporating high throughput sequencing epigenetic and transcription factor data sets from both cancer and normal cells into cancer genetic studies reveals potential functional SNPs and informs subsequent characterization efforts.

Figures

Figure 1
Figure 1
Overview of regulatory variant discovery workflow. The analysis workflow takes as input a list of SNPs identified in genome wide association studies, diverse high-throughput sequencing data related to the delineation of cis-regulatory sequences, and position weight matrices (PWMs). The input SNP lists are extended to SNPs in high linkage disequilibrium (LD). Functionality of each SNP is evaluated through the three criteria (regulatory potential, TF binding affinity and binding evidence). The output is a set of candidate variants that display characteristics consistent with a cis-regulatory role in the disease process.
Figure 2
Figure 2
Distributions of genomic functional categories of cancer GWAS LD80 SNP sets. For each SNP in the corresponding GWAS LD80 SNP set, the genomic functional category was determined based on genomic annotation, and the overall proportions were shown in the plot. Categories included coding, 5′ untranslated and 3′ untranslated portions of exons, as well as intronic, intergenic and upstream or downstream proximals (within 10 kb of the TSSs or TTSs). The distribution of Illumina 660K SNP array was presented as a background. Numbers above the chart showed the corresponding total SNP counts of each LD80 SNP set.
Figure 3
Figure 3
Heatmap illustration of enrichment of LD80 SNPs in regulatory sequences. The figure displays the degrees of enrichment significance in regulatory sequences for GWAS SNPs extended to SNPs with r2 > =0.80 (LD80). The evaluated LD80 SNP sets are indicated across the horizontal axis. The y-axis indicates the cells of origin and feature data sets that reflects regulatory sequences (all from the ENCODE consortium). Vertical and horizontal side bars are colored according to tissue types and whether it is data from a cancer or normal cell line. Enrichment testing was done by comparing the true foreground overlapping count of each SNP set with each feature data to distributions of overlapping counts by randomly selected SNP sets with matching minor allele frequencies, GC content (+/-500 bps) and distance to the nearest TSSs repeated 1000 times. Multiple hypothesis-adjusted q-values were computed. The enrichment of SNP lists within each feature is colored with a transformed value from multiple hypothesis adjusted q-values: -1x(log10 (q-values +0.0001)). Highly enriched feature and SNP list pairs are colored in yellow, and non-enriched pairs are colored in red.
Figure 4
Figure 4
Differences in regulatory potential and allelic TF binding affinity for Lung.cancer and Breast.cancer LD80 SNPs. The plots present potentially affected TFBS, with the upper panel (A & C) displaying SNPs that confer stronger TFBS patterns in cancer patients with the minor allele while the lower panel (B & D) displayed an decrease in TF binding affinity. The x-axis represents the relative regulatory potential, defined as log2 ratio of regulatory potential index between cancer and normal cells plus 1. The relative regulatory potential is indicated as positive for higher regulatory potential in cancer cells (A549 for A and B; MCF-7 for C and D) and negative for higher regulatory potential in the corresponding normal cells (NHLF normal lung fibroblasts for A and B; HMEC breast normal cells for C and D). The y-axis shows the -1xlog2 transformation of empirical p-values for motif affinity score changes. The data shown on the plot are restricted to PWMs with p-values<0.05 from the two-tailed test, and for visualization purposes, only PWMs with scores > 85 in at least one allele are shown. TFs with an increase or decrease of TF binding affinity where the SNP has non-zero regulatory potential in either cancer or normal cells are labeled along with the corresponding SNP. SNPs with zero regulatory potential index in both cells are represented by gray dots, whereas those with regulatory potential indices >0 in both cells are colored in blue. SNPs with regulatory potential index restricted to a single cell type (cancer or normal cells) are colored in red and green, respectively. In plot C, a red arrow indicates a SNP rs1391720 that is discussed in the text. The vertical bar illustrates the degree of difference in TF affinity.
Figure 5
Figure 5
Visualizing Lung.cancer and Breast.cancer LD80 SNPs with TAF ChIP-seq binding data. The relative regulatory potential is plotted along the x-axis, as in Figure 4. The y-axis displays the number of TAF ChIP-seq data sets reporting binding in multiple cells examined in this study: A549, H1 embryonic stem, HCT-116, MCF-7 cells. Each dot represents a SNP within the Lung.cancer (A) and Breast.cancer (B) LD80 lists. SNPs with zero regulatory potential indices in both cells are represented in gray dots, whereas those with regulatory potential in both cancer and normal cells are labeled and colored in blue. SNPs with only regulatory potential observed in cancer or normal cells are colored in red and green, respectively. The red arrows in B highlight a set of correlated SNPs, rs1391720, rs1391721 and rs1292011 that overlaps 15 to 16 TAF ChIP-seq peaks. ChIP-seq datasets used are detailed in the supplementary information (Additional file 2).
Figure 6
Figure 6
Annotation features proximal to the rs12087869 SNP location from the Lung.Meta case study. Part (A) depicts annotation related to genetics, epigenetics, and TAF ChIP-seq peaks in proximity to the rs12087869 SNP in A549 lung cancer, NHLF normal and H1 embryonic stem cell lines using the UCSC Genome Browser. The red vertical line highlights the location of the SNP. From the top of the figure, the genetic information includes the locations of the SNP and proximal genes, and copy number status in A549 cells. The chromatin information shows the DNase I hypersensitive sites, occupancy sites of active histone modification marks (H3K4me1, H3K4me3, H3K27ac) in the cell lines. The ChIP-seq section shows the TAF-associated regions in A549 cells where data is available. Peaks of chromatin information and ChIP-seq sections were reported by the ENCODE project with the gray scale color reflecting the magnitude of open chromatin and binding. (B) The figure illustrates both strands of the reference sequence within 15 base pairs of rs12087869, and locations of predicted TF binding sites for the reference and risk alleles in solid and dotted lines, respectively. The motif logos for the binding properties of TLX1::NFIC, MAX and Myc are also depicted at rs12087869 risk allele all with increasing binding affinity. The variant within each binding sequence below each logo is underlined, and the predicted Myc binding locations for the reference and risk alleles are different, whereas those of TLX1:NFIC and MAX were the same.
Figure 7
Figure 7
Two-dimensional heatmap of chromatin interaction in the neighbourhood of the rs12087869 SNP. The figure shows Hi-C chromatin interaction datasets in H1 human ES cells (upper) and IMR90 fibroblast cells (lower panel) obtained from Dixon et al.[31] in the neighbourhood of the rs12087869 SNP. The topological domains (TADs) from both cell types were shown to indicate genomic neighbourhood of stronger within-domain interactions. The heatmap values indicated in a color scale correspond to the number of times that reads in two 20 kb bins were sequenced as a pair, with the red color indicating stronger interaction and white being little or no interaction. The 85 percentile read counts (29 for H1 and 21 for IMR90 cells) were used as the upper limit for the heatmap to avoid color domination of extremely interactive regions. This plot was generated using ‘HiTC’ R package, and the dotted lines were drawn to aid in visualizing the interactive domain in which the SNP is located. The TAD region (from H1 cells) containing the SNP is highlighted in a light pink box.

Similar articles

See all similar articles

Cited by 28 articles

See all "Cited by" articles

References

    1. Li MJ, Wang P, Liu X, Lim EL, Wang Z, Yeager M, Wong MP, Sham PC, Chanock SJ, Wang J. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res. 2012;7:D1047–D1054. doi: 10.1093/nar/gkr1182. - DOI - PMC - PubMed
    1. Rosenbloom KR, Dreszer TR, Long JC, Malladi VS, Sloan CA, Raney BJ, Cline MS, Karolchik D, Barber GP, Clawson H, Diekhans M, Fujita PA, Goldman M, Gravell RC, Harte RA, Hinrichs AS, Kirkup VM, Kuhn RM, Learned K, Maddren M, Meyer LR, Pohl A, Rhead B, Wong MC, Zweig AS, Haussler D, Kent WJ. ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic Acids Res. 2012;7:D912–917. doi: 10.1093/nar/gkr1012. - DOI - PMC - PubMed
    1. Chen CY, Morris Q, Mitchell JA. Enhancer identification in mouse embryonic stem cells using integrative modeling of chromatin and genomic features. BMC Genomics. 2012;7:152. doi: 10.1186/1471-2164-13-152. - DOI - PMC - PubMed
    1. Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, Wang W, Weng Z, Green RD, Crawford GE, Ren B. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet. 2007;7:311–318. doi: 10.1038/ng1966. - DOI - PubMed
    1. Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA, Boyer LA, Young RA, Jaenisch R. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci U S A. 2010;7:21931–21936. doi: 10.1073/pnas.1016071107. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Feedback