CellCODE: a robust latent variable approach to differential expression analysis for heterogeneous cell populations

Bioinformatics. 2015 May 15;31(10):1584-91. doi: 10.1093/bioinformatics/btv015. Epub 2015 Jan 11.


Motivation: Identifying alterations in gene expression associated with different clinical states is important for the study of human biology. However, clinical samples used in gene expression studies are often derived from heterogeneous mixtures with variable cell-type composition, complicating statistical analysis. Considerable effort has been devoted to modeling sample heterogeneity, and presently, there are many methods that can estimate cell proportions or pure cell-type expression from mixture data. However, there is no method that comprehensively addresses mixture analysis in the context of differential expression without relying on additional proportion information, which can be inaccurate and is frequently unavailable.

Results: In this study, we consider a clinically relevant situation where neither accurate proportion estimates nor pure cell expression is of direct interest, but where we are rather interested in detecting and interpreting relevant differential expression in mixture samples. We develop a method, Cell-type COmputational Differential Estimation (CellCODE), that addresses the specific statistical question directly, without requiring a physical model for mixture components. Our approach is based on latent variable analysis and is computationally transparent; it requires no additional experimental data, yet outperforms existing methods that use independent proportion measurements. CellCODE has few parameters that are robust and easy to interpret. The method can be used to track changes in proportion, improve power to detect differential expression and assign the differentially expressed genes to the correct cell type.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • B-Lymphocytes / drug effects
  • B-Lymphocytes / metabolism
  • B-Lymphocytes / virology
  • Cell Lineage / genetics*
  • Computational Biology / methods*
  • Data Interpretation, Statistical*
  • Gene Expression Regulation*
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Influenza Vaccines / administration & dosage
  • Models, Statistical
  • Neutrophils / cytology
  • Neutrophils / immunology
  • Neutrophils / metabolism
  • Sequence Analysis, RNA / methods
  • Software*
  • T-Lymphocytes / drug effects
  • T-Lymphocytes / metabolism
  • T-Lymphocytes / virology
  • Time Factors


  • Influenza Vaccines