Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May 15;31(10):1584-91.
doi: 10.1093/bioinformatics/btv015. Epub 2015 Jan 11.

CellCODE: a robust latent variable approach to differential expression analysis for heterogeneous cell populations

Affiliations
Free PMC article

CellCODE: a robust latent variable approach to differential expression analysis for heterogeneous cell populations

Maria Chikina et al. Bioinformatics. .
Free PMC article

Abstract

Motivation: Identifying alterations in gene expression associated with different clinical states is important for the study of human biology. However, clinical samples used in gene expression studies are often derived from heterogeneous mixtures with variable cell-type composition, complicating statistical analysis. Considerable effort has been devoted to modeling sample heterogeneity, and presently, there are many methods that can estimate cell proportions or pure cell-type expression from mixture data. However, there is no method that comprehensively addresses mixture analysis in the context of differential expression without relying on additional proportion information, which can be inaccurate and is frequently unavailable.

Results: In this study, we consider a clinically relevant situation where neither accurate proportion estimates nor pure cell expression is of direct interest, but where we are rather interested in detecting and interpreting relevant differential expression in mixture samples. We develop a method, Cell-type COmputational Differential Estimation (CellCODE), that addresses the specific statistical question directly, without requiring a physical model for mixture components. Our approach is based on latent variable analysis and is computationally transparent; it requires no additional experimental data, yet outperforms existing methods that use independent proportion measurements. CellCODE has few parameters that are robust and easy to interpret. The method can be used to track changes in proportion, improve power to detect differential expression and assign the differentially expressed genes to the correct cell type.

Figures

Fig. 1.
Fig. 1.
Evaluating consistency of surrogate proportion estimates in the Shen-Orr dataset. The heatmap represents correlation coefficients between all pairs of marker genes with red (darker) representing high correlation and green (lighter) representing anti-correlation. Marker genes initially selected for a specific cell type are indicated by colors as shown in the key. The CellCODE SPVs (indicated by black) are also included. The heatmap is clustered with 1-ρ as a distance metric. Despite some apparent inconsistencies in marker assignments, distinct clusters of high correlation emerge for each cell type, and each SPV reliably associates with the correct cluster (Color version of this figure is available at Bioinformatics online.)
Fig. 2.
Fig. 2.
CellCODE SPVs track Coulter counter measurements
Fig. 3.
Fig. 3.
Recovery of cell proportions from simulated expression data. The SPVs recovered using three different approaches are plotted against the true proportions used for simulation (x axis). We simulated two clinical groups plotted in red (grey) and black with global proportion differences (in neutrophils and T cells) and true transcriptional differences coming from the T-cell population. We specifically enforced that 30% of the T-cell markers are DE. Generally, all estimates track known proportions, even for very rare cell types. The relationship is non-linear due to log transformation of expression values. However, aside from providing high correlation, the ideal estimation procedure should be unbiased, resulting in red and black points falling on a single curve. Computing eigengenes with the raw expression values (first row) or expression normalized across clinical groups (second row) leads to biased proportion estimates. CellCODE (third row) is able to provide accurate estimates that track global proportion changes while being agnostic to transcriptional alterations within individual cell types (Color version of this figure is available at Bioinformatics online.)
Fig. 4.
Fig. 4.
Increasing differential expression detection power in mixture datasets. Mixture datasets were simulated by combining pure cell expression in different proportions. For half of the 24 samples simulated, one pure cell expression vector was altered to have 10% DE genes. Cell-type origin of differential expression was varied to create a range of simulated datasets. Each resulting dataset was ranked for differential expression using different methods, and the number of genes identified with a false discovery rate of 0.1 is shown as a boxplot distribution over 20 repeats of the simulation. The CellCODE method, which uses only the data structure, outperforms methods that use known cell proportions (Color version of this figure is available at Bioinformatics online.)
Fig. 5.
Fig. 5.
Evaluating cell-type assignment methods using simulated data. Cell-type origin of differential expression is varied to create a range of simulated datasets. For each dataset, the set of DE genes is selected using the CellCODE approach (FDR 0.1) and is fixed for the subsequent analysis. These genes are assigned to the most likely cell type of origin using the different assignment methods. The fraction of correct assignments is plotted as a distribution boxplot for 20 independent repeats of the simulation. Methods that can accept rescaled covariates were evaluated using both the actual simulated cell proportions (‘Measured’, light colors) and the CellCODE SPVs (dark colors). Overall, we find that the F-test with CellCODE SPVs performs best (Color version of this figure is available at Bioinformatics online.)
Fig. 6.
Fig. 6.
Vaccine administration induces global changes in cell-type proportions. SPVs were extracted using the CellCODE method and evaluated for vaccine-related changes by comparing a model that captures individual variation only against one which includes post-vaccination day. The points and lines represent the median and interquartile range (IQR) of SPVs normalized to have mean 0 for each individual (which reduces variance without altering the trend). D0, D3 and D7 indicate day after vaccination (Color version of this figure is available at Bioinformatics online.)
Fig. 7.
Fig. 7.
CAMK4 effects antibody response through a T-cell-dependent mechanism. (A) Expression of CAMK4 on Day 3 negatively correlates with an increase in influenza-specific antibody titers. (B) CAMK4 expression correlates strongly with T-cell SPV but only slightly with B-cell SPV. Correlation with other cell-type proportions is negative (data not shown), suggesting that CAMK4 is T-cell specific. Colors denote the fold increase in antibody titers and are the same as in panel (A) (Color version of this figure is available at Bioinformatics online.)

Similar articles

See all similar articles

Cited by 30 articles

See all "Cited by" articles

Publication types

MeSH terms

Substances

Feedback