Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Oct 25;15(10):e1007357.
doi: 10.1371/journal.pcbi.1007357. eCollection 2019 Oct.

SourceSet: A Graphical Model Approach to Identify Primary Genes in Perturbed Biological Pathways

Free PMC article

SourceSet: A Graphical Model Approach to Identify Primary Genes in Perturbed Biological Pathways

Elisa Salviato et al. PLoS Comput Biol. .
Free PMC article


Topological gene-set analysis has emerged as a powerful means for omic data interpretation. Although numerous methods for identifying dysregulated genes have been proposed, few of them aim to distinguish genes that are the real source of perturbation from those that merely respond to the signal dysregulation. Here, we propose a new method, called SourceSet, able to distinguish between the primary and the secondary dysregulation within a Gaussian graphical model context. The proposed method compares gene expression profiles in the control and in the perturbed condition and detects the differences in both the mean and the covariance parameters with a series of likelihood ratio tests. The resulting evidence is used to infer the primary and the secondary set, i.e. the genes responsible for the primary dysregulation, and the genes affected by the perturbation through network propagation. The proposed method demonstrates high specificity and sensitivity in different simulated scenarios and on several real biological case studies. In order to fit into the more traditional pathway analysis framework, SourceSet R package also extends the analysis from a single to multiple pathways and provides several graphical outputs, including Cytoscape visualization to browse the results.

Conflict of interest statement

The authors have declared that no competing interests exist.


Fig 1
Fig 1. Basic workflow of the SourceSet algorithm for the analysis of a single graph.
Fig 2
Fig 2. Decomposable graph used in the simulation study.
Decomposable graph G consisting of |V| = 10 nodes, k = 5 cliques and m = 13 unique components.
Fig 3
Fig 3. Simulation study results under the alternative hypothesis in scenario 2 (top panel), in scenario 3 (middle panel), and in scenario 4 (bottom panel).
On the left, results based on the maximum likelihood estimate of the covariance matrix; on the right results based on the regularized estimate. Each subpanel corresponds to a different combination of sample size (columns) and intensity of dysregulation (rows). Inside subpanels, for each node vV, a stacked bar chart shows the percentage of Monte Carlo runs in which vD^G (red, primary set), vD^G\D^G (orange, secondary set) and vV\D^G (green). The Monte Carlo error is bounded above by 2.2%.
Fig 4
Fig 4. SourceSet run-time analysis.
(Top panel) All pairwise relationships between the parameters that define the complexity of a graph (i.e., the number of edges, the number of distinct hypotheses and the cardinality of the largest clique) for 248 KEGG pathways. Six pathways—highlighted with filled circles of different colors—of increasing complexity were chosen for the run-time analysis (see also Table 4). (Bottom panel) Run-time for the six pathways as a function of the sample size. Permutation tests and asymptotic tests are plotted with circles and squares, respectively.
Fig 5
Fig 5. Visual summary of the source set analysis results for the STAT3 dataset.
(Left) KEGG pathways containing STAT3 and all genes appearing in at least one estimated source set are cross-tabulated. The color of the cell (i, j) shows the relation between the i-th gene and the j-th pathway: blue if the gene belongs to D^G (primary source set), light blue if it belongs to D^G\D^G (secondary set), grey if the gene is participating in the considered pathway, and white otherwise (i-th gene does not belong to j-th pathway). (Right) This plot features KEGG pathways containing STAT3 and having a non-empty estimated source set, as well as all genes appearing in at least one estimated source. The three levels are to be read from left to right. A link between left element a and right element b must be interpreted as ab. A module is defined as a subset of a source set belonging to a connected subgraph of the associated pathway.
Fig 6
Fig 6. SourcSet analysis results for the chimera case study.
Boxplots of score (left panel) and relevance (right panel) indices for genes annotated in at least two pathways of the whole KEGG collection (N = 248). The size of each point is proportional to the number of pathways in which the associated gene is annotated. ABL1 and BCR (i.e., chimera genes) are highlighted with blue and light blue dots, respectively.
Fig 7
Fig 7. Source set analysis results for the prostate cancer study.
The plot shows the main cluster of the graphical union of the source sets of the analyzed pathways, obtained through sourceUnionCytoscape function (Cytoscape version 3.6.1). The size of each node is proportional to the number of times the gene appears in a source set. The color is associated to the relevance index: higher values are indicated by dark blue colors.
Fig 8
Fig 8. Mapggm analysis results for the chimera case study.
Rank (x-axis) according to the non-sequential NF test statistic (y axis) for the 67 genes annotated in the Chronic myeloid leukemia KEGG pathway. ABL1 (rank = 27) and BCR (rank = 37) genes are highlighted with blue dots.
Fig 9
Fig 9. Visual summary of the results of differential expression and source set analysis.
Stacked bar represents the proportion of genes flagged only by the differential expression analysis annotated in at least one KEGG pathway (violet) and not annotated in any KEGG pathway (pink), the source set analysis (blue), or both, in the comparison between the two considered conditions. Genes with p-values≤0.05 are flagged as deferentially expressed (DE), and those contained in the source set estimate of at least one analyzed pathway as primary.

Similar articles

See all similar articles


    1. Khatri P, Sirota M, Butte AJ. Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLOS Computational Biology. 2012;8(2):1–10. 10.1371/journal.pcbi.1002375 - DOI - PMC - PubMed
    1. Kanehisa M, Furumichi ea. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research. 2017;45(D1):D353–D361. 10.1093/nar/gkw1092 - DOI - PMC - PubMed
    1. Fabregat A, Sidiropoulos K, Garapati ea. The Reactome pathway Knowledgebase. Nucleic Acids Research. 2016;44(D1):D481–D487. 10.1093/nar/gkv1351 - DOI - PMC - PubMed
    1. Mitrea C, Taghavi Z, Bokanizad B, Hanoudi S, Tagett R, Donato M, et al. Methods and approaches in the topology-based analysis of biological pathways. Frontiers in Physiology. 2013;4:278 10.3389/fphys.2013.00278 - DOI - PMC - PubMed
    1. Sales G, Calura E, Cavalieri D, Romualdi C. graphite—a Bioconductor package to convert pathway topology to gene network. BMC Bioinformatics. 2012;13(1):20 10.1186/1471-2105-13-20 - DOI - PMC - PubMed

Publication types

Grant support

This work has been supported by Italian Association for Cancer Research (IG17185, IG21837) and Norwegian Research Council (grant no. 248804). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.