Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Aug 30;17(1):695.
doi: 10.1186/s12864-016-2871-3.

Genotype Distribution-Based Inference of Collective Effects in Genome-Wide Association Studies: Insights to Age-Related Macular Degeneration Disease Mechanism

Affiliations
Free PMC article

Genotype Distribution-Based Inference of Collective Effects in Genome-Wide Association Studies: Insights to Age-Related Macular Degeneration Disease Mechanism

Hyung Jun Woo et al. BMC Genomics. .
Free PMC article

Erratum in

Abstract

Background: Genome-wide association studies provide important insights to the genetic component of disease risks. However, an existing challenge is how to incorporate collective effects of interactions beyond the level of independent single nucleotide polymorphism (SNP) tests. While methods considering each SNP pair separately have provided insights, a large portion of expected heritability may reside in higher-order interaction effects.

Results: We describe an inference approach (discrete discriminant analysis; DDA) designed to probe collective interactions while treating both genotypes and phenotypes as random variables. The genotype distributions in case and control groups are modeled separately based on empirical allele frequency and covariance data, whose differences yield disease risk parameters. We compared pairwise tests and collective inference methods, the latter based both on DDA and logistic regression. Analyses using simulated data demonstrated that significantly higher sensitivity and specificity can be achieved with collective inference in comparison to pairwise tests, and with DDA in comparison to logistic regression. Using age-related macular degeneration (AMD) data, we demonstrated two possible applications of DDA. In the first application, a genome-wide SNP set is reduced into a small number (∼100) of variants via filtering and SNP pairs with significant interactions are identified. We found that interactions between SNPs with highest AMD association were epigenetically active in the liver, adipocytes, and mesenchymal stem cells. In the other application, multiple groups of SNPs were formed from the genome-wide data and their relative strengths of association were compared using cross-validation. This analysis allowed us to discover novel collections of loci for which interactions between SNPs play significant roles in their disease association. In particular, we considered pathway-based groups of SNPs containing up to ∼10, 000 variants in each group. In addition to pathways related to complement activation, our collective inference pointed to pathway groups involved in phospholipid synthesis, oxidative stress, and apoptosis, consistent with the AMD pathogenesis mechanism where the dysfunction of retinal pigment epithelium cells plays central roles.

Conclusions: The simultaneous inference of collective interaction effects within a set of SNPs has the potential to reveal novel aspects of disease association.

Keywords: Age-related macular degeneration; Epistasis; Genome-wide association; Machine learning; Single-nucleotide polymorphism.

Figures

Fig. 1
Fig. 1
Discrete discriminant analysis algorithm. Empirical characteristics (allele frequency and correlation) of case (y=1) and control (y=0) data are used to fit their genotype distributions with parameters hi(y) and Jij(y), each roughly determining the position and width of the distribution. Disease risk parameters are given by their differences, whereas the likelihood ratio (LR) statistic q is obtained from the difference between the sum of two contributions and the corresponding pooled value
Fig. 2
Fig. 2
Inference accuracy, sensitivity, and specificity of pairwise and collective inference on simulated data. ab The mean square error and AUC versus sample sizes using pairwise test (PW), logistic regression (LR), and the three methods of DDA (MF, PL, and EE). Simulated genotypes were generated for 10 SNPs with parameters h¯y=(1,0.3), J¯=(0,0.1), σ h=σ J=0.2 (see Methods). c-d Analogous results for 20 SNPs with h¯y=(1,1+Δh), J¯=(0,ΔJ), and σ h=σ J=0.2. We set Δ h=0.7, Δ J=0.5 for the first 4 SNPs and their interactions and Δ h=Δ J=0 otherwise. e-f Sensitivity and specificity of disease-associated interaction pairs. Simulated data were generated with parameters h¯=(1,1), J¯=(0.01,0.01), σ h=0.1, σ J=0.05 for m=10 SNPs, except the interaction between the first two SNPs, for which we set J¯=(0.01,0.11). Interaction p-values for all pairs were calculated either by PW or by regularization to determine λ followed by the construction of null distribution under λ (Additional file 5: Figure S4) for LR, PL, and EE. The distribution of p-values for the true causal interaction pair and those of non-causal pair (geometric mean) are shown in e and f, respectively. The dominant model was used in all cases
Fig. 3
Fig. 3
Collective inference applied to pre-selected m=20 AMD SNPs. ab Regularization via cross-validation. Dominant (DOM) and genotypic (GEN) models were used with logistic regression (LR), DDA PL (a), and MF (b). Independent-SNP limit is reached with λ and ε→0. Because of the pre-selection of SNPs using phenotype information, the prediction score (pseudo-AUC; pAUC) derived from 5-fold cross-validation over-estimates the true AUC. The maxima in pAUC correspond to optimal regularization. cd Single-SNP and interaction p-values of the optimized (genotypic) model under PL (λ=0.01). The p-values from independent-SNP and pairwise tests are also shown for comparison in c and d, respectively. See Additional file 1: Table S1 for the independent-SNP results and SNP list
Fig. 4
Fig. 4
Quantile-quantile plot of interaction p-values. a Distributions for interactions among m=20 SNPs randomly selected from genome-wide data. b Distributions for interactions among 20 SNPs with high association (Fig. 3 d) and a larger set (m=96; Fig. 6). See Additional file 8: Figure S7 for pairwise (PW) results for m=96
Fig. 5
Fig. 5
Collective inference with SNP selection based on independent-SNP p-values. a AUC with varying penalizer λ under PL inference, where independent-SNP p-value cutoff p c indicated was used to filter SNPs from the full genome-wide set in each cross-validation run. The mean SNP number m¯ is the average over 5 runs. b AUC optimized over regularization (MF) with varying model sizes controlled by p c. SNP selections were made from the full genome-wide data (r 2<1.0) and subsets generated by pruning based on LD thresholds indicated. Note that the maximum AUC position shifts to lower m¯ with increasing degree of pruning (fewer SNPs with LD needed to account for association) and that an optimal level of pruning (r 2<0.5) exists for highest performance. Vertical lines are 95 % C.I
Fig. 6
Fig. 6
Interaction and single-site p-values for m=96 AMD SNPs. The bars (bottom) and the heat map (top) show the single-SNP and interaction p-values, respectively. Hollow and solid bars represent the independent-SNP and collective inference p-values respectively. DDA PL was used for collective inference
Fig. 7
Fig. 7
Enrichment p-values of active epigenetic states among AMD-associated SNPs. The set of 96 SNPs in Fig. 6 was used. The reference epigenome labels are as defined in Fig. 2 of Ref. [50]. ES, embryonic stem cell; ES-deriv., ES cell-derived; HSC, hematopoietic stem cell; iPS, induced pluripotent stem cell; MSC, mesenchymal stem cell; Neurosph., neurosphere
Fig. 8
Fig. 8
Enrichment p-values of active epigenetic state pairs among AMD-associated SNP interactions. The SNP pairs with interaction p-value <10−3 in Fig. 6 were tested for enrichment within each reference epigenome pairs
Fig. 9
Fig. 9
AMD association of pathways under collective inference. a AUC score versus pathway size (number of SNPs in each pathway). Symbols show collective and independent-SNP inference AUCs under 5-fold cross validation. Vertical lines are 95 % C.I. The horizontal line represents the Bonferroni-corrected nominal discovery threshold based on the p-value estimates. b Regression of AUC versus pathway p-values. The latter were obtained for a selection of pathways via phenotype-label reshuffling using AUC as the statistic. Dotted line is the linear fit for AUC>0.52. ce Pathways with association strength AUC>0.55, grouped according to the top hierarchical classes they belong to. We excluded pathways in the Disease class. Dendrograms below the bars show their hierarchical relationships. Abl, Abl tyrosine kinase; activ., activation; assoc., association/associated; biosynth., biosynthesis; C3, complement component 3; C5, complement component 5; CCT, chaperonin-containing T-complex polypeptide 1; cell., cellular; ChREBP, carbohydrate response element-binding protein; ECM, extracellular matrix; EHMT2, euchromatic histone-lysine-methyltransferase 2; elong., elongation; ER, endoplasmic reticulum; ERCC6, excision repair cross-complementation group 6; expr., expression; form., formation; HSF, heat shock factor; IFN, interferon; indep., independent; Lys, lysine; MDA5, melanoma differentiation-associated gene 5; med., mediated; metab., metabolism; MYD88, myeloid differentiation primary response 88; NFKB, nuclear factor- κ B; PA, phosphatidic acid; PKMT, protein lysine methyltransferase; pol, polymerase; proc., processing; reg., regulate/regulation/regulated; RIG-I, retinoic acid-inducible gene-I; RIP, receptor-interaction protein; Robo, roundabout; SASP, senescence-associated secretory phenotype; sig., signaling; SMAC, second mitochondrial activator of caspases; synth., synthesis; sys., system; thru, through; TP53, tumor protein p53; transc., transcription/transcriptional; TriC, T-complex polypeptide 1 ring complex; ZBP1, Z-DNA-binding protein-1

Similar articles

See all similar articles

Cited by 7 articles

See all "Cited by" articles

References

    1. Kim YA, Wuchty S, Przytycka TM. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput Biol. 2011;7:e1001095. doi: 10.1371/journal.pcbi.1001095. - DOI - PMC - PubMed
    1. Haines JL, Hauser MA, Schmidt S, Scott WK, Olson LM, Gallins P, et al. Complement factor H variant increases the risk of age-related macular degeneration. Science. 2005;308:419–21. doi: 10.1126/science.1110359. - DOI - PubMed
    1. Edwards AO, Ritter R, Abel KJ, Manning A, Panhuysen C, Farrer LA. Complement factor H polymorphism and age-related macular degeneration. Science. 2005;308:421–4. doi: 10.1126/science.1110189. - DOI - PubMed
    1. Wang WY, Barratt BJ, Clayton DG, Todd JA. Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet. 2005;6:109–18. doi: 10.1038/nrg1522. - DOI - PubMed
    1. The Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–78. doi: 10.1038/nature05911. - DOI - PMC - PubMed

Publication types

Feedback