A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data

J Luo; M Schumacher; A Scherer; D Sanoudou; D Megherbi; T Davison; T Shi; W Tong; L Shi; H Hong; C Zhao; F Elloumi; W Shi; R Thomas; S Lin; G Tillinghast; G Liu; Y Zhou; D Herman; Y Li; Y Deng; H Fang; P Bushel; M Woods; J Zhang

doi:10.1038/tpj.2010.57

A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data

Pharmacogenomics J. 2010 Aug;10(4):278-91. doi: 10.1038/tpj.2010.57.

Authors

J Luo¹, M Schumacher, A Scherer, D Sanoudou, D Megherbi, T Davison, T Shi, W Tong, L Shi, H Hong, C Zhao, F Elloumi, W Shi, R Thomas, S Lin, G Tillinghast, G Liu, Y Zhou, D Herman, Y Li, Y Deng, H Fang, P Bushel, M Woods, J Zhang

Affiliation

¹ Systems Analytics Inc., Waltham, MA, USA.

Abstract

Batch effects are the systematic non-biological differences between batches (groups) of samples in microarray experiments due to various causes such as differences in sample preparation and hybridization protocols. Previous work focused mainly on the development of methods for effective batch effects removal. However, their impact on cross-batch prediction performance, which is one of the most important goals in microarray-based applications, has not been addressed. This paper uses a broad selection of data sets from the Microarray Quality Control Phase II (MAQC-II) effort, generated on three microarray platforms with different causes of batch effects to assess the efficacy of their removal. Two data sets from cross-tissue and cross-platform experiments are also included. Of the 120 cases studied using Support vector machines (SVM) and K nearest neighbors (KNN) as classifiers and Matthews correlation coefficient (MCC) as performance metric, we find that Ratio-G, Ratio-A, EJLR, mean-centering and standardization methods perform better or equivalent to no batch effect removal in 89, 85, 83, 79 and 75% of the cases, respectively, suggesting that the application of these methods is generally advisable and ratio-based methods are preferred.

Publication types

Comparative Study

MeSH terms

Algorithms
Breast Neoplasms / drug therapy
Breast Neoplasms / genetics
Databases, Genetic
Female
Gene Expression Profiling / methods
Gene Expression Profiling / standards
Humans
Liver Neoplasms / drug therapy
Liver Neoplasms / genetics
Oligonucleotide Array Sequence Analysis / methods*
Oligonucleotides
Predictive Value of Tests
Quality Control
Reference Standards
Reproducibility of Results
Toxicogenetics / statistics & numerical data

Substances

Oligonucleotides