Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
, 20 (1), 203

Identifying Significantly Impacted Pathways: A Comprehensive Review and Assessment

Affiliations
Review

Identifying Significantly Impacted Pathways: A Comprehensive Review and Assessment

Tuan-Minh Nguyen et al. Genome Biol.

Erratum in

Abstract

Background: Many high-throughput experiments compare two phenotypes such as disease vs. healthy, with the goal of understanding the underlying biological phenomena characterizing the given phenotype. Because of the importance of this type of analysis, more than 70 pathway analysis methods have been proposed so far. These can be categorized into two main categories: non-topology-based (non-TB) and topology-based (TB). Although some review papers discuss this topic from different aspects, there is no systematic, large-scale assessment of such methods. Furthermore, the majority of the pathway analysis approaches rely on the assumption of uniformity of p values under the null hypothesis, which is often not true.

Results: This article presents the most comprehensive comparative study on pathway analysis methods available to date. We compare the actual performance of 13 widely used pathway analysis methods in over 1085 analyses. These comparisons were performed using 2601 samples from 75 human disease data sets and 121 samples from 11 knockout mouse data sets. In addition, we investigate the extent to which each method is biased under the null hypothesis. Together, these data and results constitute a reliable benchmark against which future pathway analysis methods could and should be tested.

Conclusion: Overall, the result shows that no method is perfect. In general, TB methods appear to perform better than non-TB methods. This is somewhat expected since the TB methods take into consideration the structure of the pathway which is meant to describe the underlying phenomena. We also discover that most, if not all, listed approaches are biased and can produce skewed results under the null.

Keywords: Bias; Metabolic pathways; Network topology; Pathway analysis; Signaling pathways; Statistical significance.

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
The process of evaluating a pathway analysis method based on their ability to identify target pathways. Each pathway analysis method is applied on 75 data sets. Methods are evaluated based on their ability to rank the target pathways. In this example, a data set of Alzheimer’s disease is examined, and thus, the target pathway is “Alzheimer’s disease.” Each method produces lists of ranks and p values of the target pathways, which are then used to assess its performance
Fig. 2
Fig. 2
The Ranks and p values of target pathways derived by 13 methods. We perform each method on 75 human benchmark data sets. The resulting ranks and p values of target pathways are plotted in violin plots. The horizontal axis shows the pathway analysis methods in both subfigures. The vertical axis in a represents the ranks while the vertical axis in b corresponds to p values of the target pathways. Hereafter, the labels of non-TB and TB methods are written in blue and red, respectively
Fig. 3
Fig. 3
The performances of non-TB and TB methods in term of ranks (a) and p values (b) of target pathways. We collect all the ranks and p values in Fig. 2 and divide them accordingly into two groups: non-TB and TB methods. Here, lower is better for both ranks and p values. The WRS test indicates that TB methods achieved significantly lower ranks (WRS p value = 8.771E−3) and p values (WRS p value = 4.51E−4) than those of non-TB methods
Fig. 4
Fig. 4
The AUCs of eight methods using 11 KO data sets (higher is better). CePaORA, CePaGSA, and PathNet are left out in this comparison because they do not support mouse pathways. ROntoTools has the highest median value of AUC, followed by GSEA and SPIA (a). Overall, the AUCs obtained by TB methods are better than those from non-TB ones (Wilcoxon p value = 0.009) (b)
Fig. 5
Fig. 5
The process of creating the null distributions of p values for all pathways by a given pathway analysis method. Control samples from data sets are gathered to construct a control sample pool. To create the null distribution of p values of all pathways under the null for each method, more than 2000 iterations were performed. The data sets used in these iterations are generated by randomly selecting samples from the control sample pool
Fig. 6
Fig. 6
The number of biased pathways calculated based on Pearson’s moment coefficient. Under the true null hypothesis, an ideal method would produce a uniform distribution of p values from 0 to 1 for every pathway. Here, thresholds of Pearson’s moment coefficient of 0.1 and − 0.1 are used to determine if the empirical distribution of p values is biased toward 0 or 1, respectively. a The total number of biased pathways (toward either 0 or 1) produced by each method. Each method, except GSEA, has at least 66 biased pathways. b The number of pathways biased toward 0 (false positives) produced by different methods. FE produces the highest number (137 out of 150 pathways) of false positives, followed by WRS (114 out of 150) and CePaGSA (112 out of 186). c The number of pathways biased toward 1 (false negatives) produced by different methods. PathNet produces the highest number (129 out of 130) of false negative pathways. The methods in red are TB methods. The methods in blue are non-TB methods
Fig. 7
Fig. 7
The number of methods biased for each pathway. The y-axis shows the KEGG pathways, while the x-axis indicates the number of methods biased toward 0 and 1, respectively. Each horizontal line represents a pathway. The lengths of the blue and red lines show the number of methods in this study biased toward 0 and 1, respectively. Pathways are sorted by the number of methods biased. There is no pathway that is unbiased for all methods. The top 10 least and top 10 most biased pathways are shown by name

Similar articles

See all similar articles

Cited by 3 articles

References

    1. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ. Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA. 2005;102(38):13544–9. - PMC - PubMed
    1. Kim S-Y, Volsky DJ. PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics. 2005;6(1):144. - PMC - PubMed
    1. Al-Shahrour F, Díaz-Uriarte R, Dopazo J. Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information. Bioinformatics. 2005;21(13):2988–93. - PubMed
    1. Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. - PMC - PubMed
    1. Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, Caudy M, Garapati P, Gillespie M, Kamdar MR, Jassal B, Jupe S, Matthews L, May B, Palatnik S, Rothfels K, Shamovsky V, Song H, Williams M, Birney E, Hermjakob H, Stein L, D’Eustachio P. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014;42(D1):472–7. - PMC - PubMed

Publication types

LinkOut - more resources

Feedback