Imputing gene expression to maximize platform compatibility

Bioinformatics. 2017 Feb 15;33(4):522-528. doi: 10.1093/bioinformatics/btw664.


Microarray measurements of gene expression constitute a large fraction of publicly shared biological data, and are available in the Gene Expression Omnibus (GEO). Many studies use GEO data to shape hypotheses and improve statistical power. Within GEO, the Affymetrix HG-U133A and HG-U133 Plus 2.0 are the two most commonly used microarray platforms for human samples; the HG-U133 Plus 2.0 platform contains 54 220 probes and the HG-U133A array contains a proper subset (21 722 probes). When different platforms are involved, the subset of common genes is most easily compared. This approach results in the exclusion of substantial measured data and can limit downstream analysis. To predict the expression values for the genes unique to the HG-U133 Plus 2.0 platform, we constructed a series of gene expression inference models based on genes common to both platforms. Our model predicts gene expression values that are within the variability observed in controlled replicate studies and are highly correlated with measured data. Using six previously published studies, we also demonstrate the improved performance of the enlarged feature space generated by our model in downstream analysis.

Availability and implementation: The gene inference model described in this paper is available as a R package (affyImpute), which can be downloaded at


Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Gene Expression Profiling
  • Gene Expression Regulation*
  • Genomics
  • Humans
  • Models, Genetic*
  • Oligonucleotide Array Sequence Analysis