Gene set enrichment for reproducible science: comparison of CERNO and eight other algorithms

Joanna Zyla; Michal Marczyk; Teresa Domaszewska; Stefan H E Kaufmann; Joanna Polanska; January Weiner

doi:10.1093/bioinformatics/btz447

Gene set enrichment for reproducible science: comparison of CERNO and eight other algorithms

Bioinformatics. 2019 Dec 15;35(24):5146-5154. doi: 10.1093/bioinformatics/btz447.

Authors

Joanna Zyla^{1

2}, Michal Marczyk^{1

3}, Teresa Domaszewska², Stefan H E Kaufmann², Joanna Polanska¹, January Weiner²

Affiliations

¹ Data Mining Group, Faculty of Automatic Control, Electronic and Computer Science, Institute of Automatic Control, Silesian University of Technology, Gliwice, Poland.
² Department of Immunology, Max Planck Institute for Infection Biology, Berlin, Germany.
³ Yale School of Medicine, Yale Cancer Center, New Haven, CT 06510, USA.

Abstract

Motivation: Analysis of gene set (GS) enrichment is an essential part of functional omics studies. Here, we complement the established evaluation metrics of GS enrichment algorithms with a novel approach to assess the practical reproducibility of scientific results obtained from GS enrichment tests when applied to related data from different studies.

Results: We evaluated eight established and one novel algorithm for reproducibility, sensitivity, prioritization, false positive rate and computational time. In addition to eight established algorithms, we also included Coincident Extreme Ranks in Numerical Observations (CERNO), a flexible and fast algorithm based on modified Fisher P-value integration. Using real-world datasets, we demonstrate that CERNO is robust to ranking metrics, as well as sample and GS size. CERNO had the highest reproducibility while remaining sensitive, specific and fast. In the overall ranking Pathway Analysis with Down-weighting of Overlapping Genes, CERNO and over-representation analysis performed best, while CERNO and GeneSetTest scored high in terms of reproducibility.

Availability and implementation: tmod package implementing the CERNO algorithm is available from CRAN (cran.r-project.org/web/packages/tmod/index.html) and an online implementation can be found at http://tmod.online/. The datasets analyzed in this study are widely available in the KEGGdzPathwaysGEO, KEGGandMetacoreDzPathwaysGEO R package and GEO repository.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Reproducibility of Results
Software*