Rank-based methods as a non-parametric alternative of the T-statistic for the analysis of biological microarray data

Rainer Breitling; Pawel Herzyk

doi:10.1142/s0219720005001442

Rank-based methods as a non-parametric alternative of the T-statistic for the analysis of biological microarray data

J Bioinform Comput Biol. 2005 Oct;3(5):1171-89. doi: 10.1142/s0219720005001442.

Authors

Rainer Breitling¹, Pawel Herzyk

Affiliation

¹ Bioinformatics Research Centre and Molecular Plant Sciences Group, Institute of Biomedical and Life Sciences, University of Glasgow, Glasgow, G12 8QQ, United Kingdom. r.breitling@bio.gla.ac.uk

PMID: 16278953
DOI: 10.1142/s0219720005001442

Abstract

We have recently introduced a rank-based test statistic, RankProducts (RP), as a new non-parametric method for detecting differentially expressed genes in microarray experiments. It has been shown to generate surprisingly good results with biological datasets. The basis for this performance and the limits of the method are, however, little understood. Here we explore the performance of such rank-based approaches under a variety of conditions using simulated microarray data, and compare it with classical Wilcoxon rank sums and t-statistics, which form the basis of most alternative differential gene expression detection techniques. We show that for realistic simulated microarray datasets, RP is more powerful and accurate for sorting genes by differential expression than t-statistics or Wilcoxon rank sums - in particular for replicate numbers below 10, which are most commonly used in biological experiments. Its relative performance is particularly strong when the data are contaminated by non-normal random noise or when the samples are very inhomogenous, e.g. because they come from different time points or contain a mixture of affected and unaffected cells. However, RP assumes equal measurement variance for all genes and tends to give overly optimistic p-values when this assumption is violated. It is therefore essential that proper variance stabilizing normalization is performed on the data before calculating the RP values. Where this is impossible, another rank-based variant of RP (average ranks) provides a useful alternative with very similar overall performance. The Perl scripts implementing the simulation and evaluation are available upon request. Implementations of the RP method are available for download from the authors website (http://www.brc.dcs.gla.ac.uk/glama).

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Artifacts*
Data Interpretation, Statistical
Gene Expression Profiling / methods*
Models, Genetic*
Models, Statistical
Molecular Biology / methods
Oligonucleotide Array Sequence Analysis / methods*
Pattern Recognition, Automated / methods*
Statistics, Nonparametric