Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Feb 27;10(2):e1003420.
doi: 10.1371/journal.pcbi.1003420. eCollection 2014 Feb.

Learning gene networks under SNP perturbations using eQTL datasets

Affiliations

Learning gene networks under SNP perturbations using eQTL datasets

Lingxue Zhang et al. PLoS Comput Biol. .

Erratum in

  • PLoS Comput Biol. 2014 Apr;10(4):e1003608

Abstract

The standard approach for identifying gene networks is based on experimental perturbations of gene regulatory systems such as gene knock-out experiments, followed by a genome-wide profiling of differential gene expressions. However, this approach is significantly limited in that it is not possible to perturb more than one or two genes simultaneously to discover complex gene interactions or to distinguish between direct and indirect downstream regulations of the differentially-expressed genes. As an alternative, genetical genomics study has been proposed to treat naturally-occurring genetic variants as potential perturbants of gene regulatory system and to recover gene networks via analysis of population gene-expression and genotype data. Despite many advantages of genetical genomics data analysis, the computational challenge that the effects of multifactorial genetic perturbations should be decoded simultaneously from data has prevented a widespread application of genetical genomics analysis. In this article, we propose a statistical framework for learning gene networks that overcomes the limitations of experimental perturbation methods and addresses the challenges of genetical genomics analysis. We introduce a new statistical model, called a sparse conditional Gaussian graphical model, and describe an efficient learning algorithm that simultaneously decodes the perturbations of gene regulatory system by a large number of SNPs to identify a gene network along with expression quantitative trait loci (eQTLs) that perturb this network. While our statistical model captures direct genetic perturbations of gene network, by performing inference on the probabilistic graphical model, we obtain detailed characterizations of how the direct SNP perturbation effects propagate through the gene network to perturb other genes indirectly. We demonstrate our statistical method using HapMap-simulated and yeast eQTL datasets. In particular, the yeast gene network identified computationally by our method under SNP perturbations is well supported by the results from experimental perturbation studies related to DNA replication stress response.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Illustration of our statistical framework for learning a gene network with genetical genomics analysis.
(A) The graph structure of sparse CGGM for modeling a gene network perturbed by SNPs. The gene network is defined over gene-expression traits formula image's (formula image). The edges between gene-expression traits formula image's and SNPs formula image's (formula image) indicate the direct perturbations of the gene-expression traits by the given SNPs. The nodes for SNPs formula image's are shaded to show that the SNPs are conditioning variables in the conditional probability model. (B) Illustration of how the effects of the direct perturbation of the gene network by SNP formula image propagate through the gene network, as obtained by performing inference on sparse CGGM in Panel (A). While SNP formula image perturbs gene-expression traits formula image and formula image directly, this effect propagates through the network to perturb the expressions of other genes indirectly. The two directly perturbed genes formula image and formula image are shown as diamond-shaped nodes. The size and color-shade of each node indicate the strength of indirect perturbation of the given gene-expression trait by SNP formula image, with a larger and darker node for stronger perturbation. (C) The portion of the overall indirect SNP perturbation effects in Panel (B) that arose from the propagation of the direct perturbation of gene formula image by SNP formula image. (D) The portion of the overall indirect SNP perturbation effects in Panel (B) that arose from the propagation of the direct perturbation of gene formula image by SNP formula image. Within our statistical framework, we can perform inference on sparse CGGM in Panel (A) to obtain the indirect perturbations in Panel (B), and then decompose the indirect perturbations in Panel (B) into Panels (C) and (D) in a principled manner.
Figure 2
Figure 2. Comparison of the behavior of sparse CGGM, MRCE, and GFlasso using a single simulated dataset.
A known sparse CGGM was used to generate the simulated dataset. The left, middle, and right columns show the absolute values of formula image for gene-network edge weights, formula image for the strengths of direct SNP perturbations, and formula image for strengths of indirect SNP perturbations, respectively. In the middle and right columns, formula image and formula image are shown with gene-expression traits in rows and SNPs in columns. White pixels represent zero elements and darker pixels represent non-zero elements of the parameter matrix. The true model parameters are shown in Panel (A), and the estimated parameters are shown for (B) sparse CGGM, (C) MRCE, and (D) GFlasso. MRCE and GFlasso use the standard regression model for eQTL mapping, and thus provide a single summary of SNP effects on gene expressions in formula image. GFlasso focuses only on the task of eQTL mapping and thus, does not provide an estimate of gene network.
Figure 3
Figure 3. Precision-recall curves for estimated gene network structures using datasets simulated from sparse CGGMs.
Each panel shows the results from datasets simulated under different parameter settings for formula image (rows) and formula image (columns). Each precision-recall curve was obtained as an average over results from 50 simulated datasets. Simulated datasets with 30 gene-expression traits and 500 SNPs were used.
Figure 4
Figure 4. Precision-recall curves for estimated eQTLs using datasets simulated from sparse CGGMs.
Precision-recall curves for the recovery of eQTLs are shown, using the same simulated datasets and estimated models in Figure 3. Each panel shows results from datasets simulated under different parameter settings for formula image (rows) and formula image (columns). For sparse CGGMs, each panel shows two precision-recall curves, one for eQTLs with direct perturbation effects formula image and another for indirect perturbation effects formula image, whereas for MRCE and GFlasso, the results are shown only for the association strengths formula image.
Figure 5
Figure 5. Prediction errors in simulation studies.
Prediction errors on an independent test data given the estimated models in Figures 3 and 4 are shown as boxplots for different parameter settings for formula image (rows) and formula image (columns).
Figure 6
Figure 6. Results from datasets simulated from the standard linear regression model.
(A) Precision-recall curves for the recovery of gene network structure in formula image (or formula image). (B) Precision-recall curves for the recovery of eQTLs in formula image. (C) Prediction errors. The results were obtained as an average over 50 simulated datasets. Simulated datasets with 30 gene-expression traits and 500 SNPs were used.
Figure 7
Figure 7. Results from large-scale datasets simulated with sparse CGGMs.
(A) Precision-recall curves for the recovery of gene network structure in formula image (or formula image). (B) Precision-recall curves for the recovery of eQTLs in formula image. (C) Prediction errors. The results were obtained as an average over 30 simulated datasets. Simulated datasets with 500 gene-expression traits and 1,000 SNPs were used.
Figure 8
Figure 8. Computation time and scalability.
The computation time for a single run of sparse CGGM, MRCE, and GFlasso is shown for (A) varying the number of gene-expression traits formula image with the number of SNPs fixed at formula image and (B) varying the number of SNPs formula image with the number of gene-expression traits fixed at formula image. The results for MRCE were obtained using the approximate algorithm.
Figure 9
Figure 9. Comparison of SNP perturbation effect sizes on yeast gene network in the estimated sparse CGGM.
Histograms of the effect sizes of direct and indirect SNP perturbations in yeast are shown for (A) all eQTLs and (B) cis eQTLs identified by sparse CGGM.
Figure 10
Figure 10. Yeast gene-subnetwork for DNA replication stress response and its SNP perturbation estimated by sparse CGGM.
(A) The yeast subnetwork for DNA replication stress response and its direct/indirect perturbation by a SNP in the region of 1,095 kb on chromosome 4 estimated by sparse CGGM learning algorithm. This SNP directly perturbs TFS1, HSP26, RTN2, and GAD1, and the propagation of this direct perturbation to other parts of the network is obtained by performing inference on the estimated sparse CGGM. Edge thicknesses correspond to absolute values of edge weights in formula image. The diamond-shaped nodes represent gene-expression traits that are directly perturbed by the SNP, whereas the round and colored nodes represent those genes whose expressions are indirectly perturbed by the SNP. The color shade and size of nodes indicate the strength of the SNP perturbation of gene-expression trait. Our statistical framework allows the overall indirect SNP perturbation effects in Panel (A) to be decomposed into the components that arose from the propagation of the direct perturbation effects of each of (B) TFS1, (C) HSP26, (D) RTN2, and (E) GAD1 by the given SNP.
Figure 11
Figure 11. Decomposition of yeast gene-expression covariances for DNA replication stress response subnetwork using sparse CGGM.
(A) The covariance of yeast gene expression data for the genes shown in Figure 10. Sparse CGGM allows the observed covariance in Panel (A) to be decomposed approximately into (B) the covariance induced by the gene network and (C) the covariance induced by SNP perturbations and its propagation through the network. Edge width corresponds to covariance or the strength of gene-gene interaction. We note that the edges show marginal dependencies in covariances rather than conditional dependencies in inverse covariances.

Similar articles

Cited by

References

    1. Tong A, Evangelista M, Parsons A, Xu H, Bader G, et al. (2001) Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294: 2364–2368. - PubMed
    1. Hu Z, Killion P, Iyer V (2007) Genetic reconstruction of a functional transcriptional regulatory network. Nature Genetics 39: 683–687. - PubMed
    1. Chua G, Morris Q, Sopko R, Robinson M, Ryan O, et al. (2006) Identifying transcription factor functions and targets by phenotypic activation. PNAS 103: 12045–50. - PMC - PubMed
    1. Jansen RC, Nap JP (2001) Genetical genomics: the added value from segregation. Trends in Genetics 17: 388–391. - PubMed
    1. Jansen R (2003) Studying complex biological systems using multifactorial perturbation. Nature Reviews Genetics 4: 145–151. - PubMed

Publication types

Grants and funding

This material is based upon work supported by an NSF CAREER Award under grant No. MCB-1149885, Sloan Research Fellowship, and Okawa Foundation Research Grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources