Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
, 10 (1), 2975

Accurate Estimation of Cell-Type Composition From Gene Expression Data

Affiliations
Comparative Study

Accurate Estimation of Cell-Type Composition From Gene Expression Data

Daphne Tsoucas et al. Nat Commun.

Abstract

The rapid development of single-cell transcriptomic technologies has helped uncover the cellular heterogeneity within cell populations. However, bulk RNA-seq continues to be the main workhorse for quantifying gene expression levels due to technical simplicity and low cost. To most effectively extract information from bulk data given the new knowledge gained from single-cell methods, we have developed a novel algorithm to estimate the cell-type composition of bulk data from a single-cell RNA-seq-derived cell-type signature. Comparison with existing methods using various real RNA-seq data sets indicates that our new approach is more accurate and comprehensive than previous methods, especially for the estimation of rare cell types. More importantly, our method can detect cell-type composition changes in response to external perturbations, thereby providing a valuable, cost-effective method for dissecting the cell-type-specific effects of drug treatments or condition changes. As such, our method is applicable to a wide range of biological and clinical investigations.

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
A simple simulation shows the advantages of a weighted least squares method. a A plot of relative percent error in estimation using both unweighted and weighted least squares approaches, for each of three cell types across various proportions of cell type 1, the rare cell type. Because of the increased influence of rare-cell-type-specific marker genes in the weighted sum of squares error, the weighted least squares method performs better in the estimation of rare cell types than the unweighted method. b A plot of relative percent error in estimation using both unweighted and weighted least squares approaches, for each of three cell types across various ratios of mean gene expression level between marker genes of cell type 1 and marker genes of cell types 2 and 3. Because of the increased influence of lowly expressed marker genes in the weighted sum of squares error, the weighted least squares method performs better in the estimation of all cell types than the unweighted method
Fig. 2
Fig. 2
Results from the deconvolution of 27 simulated bulk data sets. a The mean relative percent error in estimation for each cell type across 27 simulated data sets from donor, melanoma, and ovarian cancer patient immune and tumor cells, plotted against the average true proportion of the cell type, for each method (dampened weighted least squares (DWLS), quadratic programming (QP), and ν-support vector regression (ν-SVR). The fitted lines represent the trend in estimation accuracy as a function of cell-type proportion. b A subset of the deconvolution cell-type proportion estimates, plotted against the true cell-type proportions. Here, only the rarest cell types, dendritic and endothelial cells, are shown. Correlation values between true and estimated proportions are used to quantify estimation accuracy. The 45° line in each plot represents the optimal estimate. The top row shows all estimates, while the bottom row shows a zoomed-in version focused on only the rarest cell types
Fig. 3
Fig. 3
Deconvolution of eight normal mouse bulk data sets characterized by the MCA. a Results from the deconvolution of each bulk data set using a signature constructed from the mouse cell atlas (MCA), using three deconvolution methods: dampened weighted least squares (DWLS), quadratic programming (QP), and ν-support vector regression (ν-SVR). Estimates are plotted against an approximate true cell-type proportion as defined by the MCA data. Correlation values between true and estimated proportions are used to quantify estimation accuracy for each method. The 45° line in each plot represents the optimal estimate. The top row shows all estimates, while the bottom row shows a zoomed-in version focused on only the rarest cell types. b Another view of the kidney deconvolution estimates under each deconvolution method via a heatmap, where each box corresponds to a cell-type proportion estimate, and a darker color corresponds a higher estimated proportion. Colors are shown on a log scale. c A summary of deconvolution results across all eight bulk samples, quantified by (1) correlation between true and estimated cell-type proportions for each tissue (left panel), (2) sensitivity of each deconvolution method (middle panel), and (3) specificity of each deconvolution method (right panel). The center line of the boxplot corresponds to the median value, while bounds of the boxplot correspond to the 25th and 75th percentiles. The upper whisker bound corresponds to the smaller of the maximum value and the 75th percentile plus 1.5 interquartile ranges; the lower corresponds to the larger of the smallest value and the 25th percentile minus 1.5 interquartile ranges
Fig. 4
Fig. 4
Deconvolution estimates of bulk mouse ISC data across various conditions. The control condition corresponds to Lgr5-eGFP+ intestine cells 1.5 days post treatment with Ad-Fc, the loss of function (LOF) condition corresponds to Lgr5-eGFP+ intestine cells 1.5 days post treatment with Ad-LGR5-ECD, and the gain of function (GOF) condition corresponds to Lgr5-eGFP+ intestine cells 1.5 days post treatment with Ad-RSPO1. Each point corresponds to the deconvolution estimate of a cell type for a single bulk data set, for the dampened weighted least squares (DWLS), quadratic programming (QP), and ν-support vector regression (ν-SVR) deconvolution methods. Cell types include cycling and non-cycling intestinal stem cells (ISCs), transit amplifying (TA) cells, and various differentiated cell types. The center line of the boxplot corresponds to the median value, while bounds of the boxplot correspond to the 25th and 75th percentiles. The upper whisker bound corresponds to the smaller of the maximum value and the 75th percentile plus 1.5 interquartile ranges; the lower corresponds to the larger of the smallest value and the 25th percentile minus 1.5 interquartile ranges

Similar articles

See all similar articles

Cited by 1 article

References

    1. Repsilber D, et al. Biomarker discovery in heterogeneous tissue samples-taking the in-silico deconfounding approach. BMC Bioinforma. 2010;11:27. doi: 10.1186/1471-2105-11-27. - DOI - PMC - PubMed
    1. Avila Cobos F, Vandesompele J, Mestdagh P, De Preter K. Computational deconvolution of transcriptomics data from mixed cell populations. Bioinformatics. 2018;34:1969–1979. doi: 10.1093/bioinformatics/bty019. - DOI - PubMed
    1. Shen-Orr SS, Gaujoux R. Computational deconvolution: extracting cell type-specific information from heterogeneous samples. Curr. Opin. Immunol. 2013;25:571–578. doi: 10.1016/j.coi.2013.09.015. - DOI - PMC - PubMed
    1. Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS One. 2009;4:e6098. doi: 10.1371/journal.pone.0006098. - DOI - PMC - PubMed
    1. Li B, et al. Comprehensive analyses of tumor immunity: implications for cancer immunotherapy. Genome Biol. 2016;17:174. doi: 10.1186/s13059-016-1028-7. - DOI - PMC - PubMed

Publication types

Feedback