Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 15;21(1):191.
doi: 10.1186/s12859-020-3538-2.

CIPR: A Web-Based R/shiny App and R Package to Annotate Cell Clusters in Single Cell RNA Sequencing Experiments

Affiliations
Free PMC article

CIPR: A Web-Based R/shiny App and R Package to Annotate Cell Clusters in Single Cell RNA Sequencing Experiments

H Atakan Ekiz et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Single cell RNA sequencing (scRNAseq) has provided invaluable insights into cellular heterogeneity and functional states in health and disease. During the analysis of scRNAseq data, annotating the biological identity of cell clusters is an important step before downstream analyses and it remains technically challenging. The current solutions for annotating single cell clusters generally lack a graphical user interface, can be computationally intensive or have a limited scope. On the other hand, manually annotating single cell clusters by examining the expression of marker genes can be subjective and labor-intensive. To improve the quality and efficiency of annotating cell clusters in scRNAseq data, we present a web-based R/Shiny app and R package, Cluster Identity PRedictor (CIPR), which provides a graphical user interface to quickly score gene expression profiles of unknown cell clusters against mouse or human references, or a custom dataset provided by the user. CIPR can be easily integrated into the current pipelines to facilitate scRNAseq data analysis.

Results: CIPR employs multiple approaches for calculating the identity score at the cluster level and can accept inputs generated by popular scRNAseq analysis software. CIPR provides 2 mouse and 5 human reference datasets, and its pipeline allows inter-species comparisons and the ability to upload a custom reference dataset for specialized studies. The option to filter out lowly variable genes and to exclude irrelevant reference cell subsets from the analysis can improve the discriminatory power of CIPR suggesting that it can be tailored to different experimental contexts. Benchmarking CIPR against existing functionally similar software revealed that our algorithm is less computationally demanding, it performs significantly faster and provides accurate predictions for multiple cell clusters in a scRNAseq experiment involving tumor-infiltrating immune cells.

Conclusions: CIPR facilitates scRNAseq data analysis by annotating unknown cell clusters in an objective and efficient manner. Platform independence owing to Shiny framework and the requirement for a minimal programming experience allows this software to be used by researchers from different backgrounds. CIPR can accurately predict the identity of a variety of cell clusters and can be used in various experimental contexts across a broad spectrum of research areas.

Keywords: Cluster analysis; Gene expression profiling; Identity prediction; Immune cells; Similarity; Single cell RNA-sequencing.

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
CIPR provides a R/Shiny-powered graphical user interface to facilitate cluster annotation in scRNAseq experiments. a T-distributed stochastic neighbor embedding (t-SNE) plot for the example scRNAseq data derived from murine melanoma tumor infiltrating lymphocytes shows 15 distinct immune cell clusters within the tumor microenvironment (the dataset contains 13,985 features and 11,054 cells) [28]. To demonstrate the capabilities of CIPR we focus on clusters 05 and 15 which distinctly expressed (b) natural killer cell (NK) and (c) plasmacytoid dendritic cell (pDC) markers respectively. d We used the CIPR pipeline to score the gene expression profiles of cluster 15 (pDC) against 296 mouse immune cells found in the ImmGen reference. CIPR algorithm calculates a distinct identity score for each reference cell type and generates a graphical summary of the results. In these plots, 4 highest data points (red rectangle) correspond to pDC samples within the ImmGen reference. The shaded regions in the graphs delineate 1 and 2 standard deviations around the mean identity score calculated from the entire reference data frame. Data points are color-coded based on the reference cell type allowing an easy assessment of the results. e The CIPR results for cluster 05 (NK cells) is shown. Marked data points depict the NK cells in the ImmGen dataset that had the highest identity scores. Users can visualize graphs for each cluster separately and have the option of further manipulating the plots if the R package implementation of CIPR is used. f CIPR can also generate graphical outputs to summarize the 5 top-scoring reference samples for each experimental cluster. The scatter plot shows the pDC and NK cell subsets that had the highest scores for clusters 05 and 15. In Shiny implementation of CIPR, users can draw rectangles around these points to prompt a table output which provides further information about the reference cell types on the graph
Fig. 2
Fig. 2
Different analytical methods implemented in CIPR performs comparably to annotate single cell clusters. Three of the analytical methods in CIPR (logFC dot product, logFC Spearman’s or Pearson’s correlation) utilizes only differentially expressed genes in clusters. The recommended approach in CIPR is logFC dot product method since it takes both the direction and the amount of differential expression into account when calculating identity scores per cluster. The other approaches in CIPR are designed to analyze the expression profiles of all the genes in the experimental data regardless of their differential expression status. This figure compares the predictions of the logFC dot product method to other analytical approaches in CIPR. Data points in the scatter plots indicate the identity score of individual ImmGen reference cell subsets calculated for clusters 05 and 15 by different methods. As expected, there is a strong correlation between the results of logFC dot product method and (a) logFC Spearman’s and (b) logFC Pearson’s correlation methods for both clusters. c, d The same strong correlation was observed when the z-scores were compared for these methods, although logFC dot product differentiated the highest scoring reference subsets slightly better as evidenced by a higher z-score. The results of (e) all-genes Spearman’s and (f) all-genes Pearson’s methods show an overall positive correlation with those from logFC dot product method, although logFC dot product approach was able to better differentiate the top-scoring reference subsets as evidenced by higher z-scores shown in panels g and h. Similar observations were made for other clusters in the experimental dataset but are not shown due to space constraints
Fig. 3
Fig. 3
CIPR performs faster than other cluster analysis approaches and produces comparable results. a SingleR and scmap are recently described R packages for automated cluster analysis which can perform analyses at the cluster level similarly to the CIPR approach. These algorithms were shown to perform well in various experimental contexts and can serve as a high benchmark for automated cluster analysis solutions. By performing all the analyses at the cluster level, here we report a comparison of CIPR R package (v.0.1.0), SingleR (v1.0.5) and scmap (v1.8.0) in terms of predictions and performance. For these comparisons, a Surface Pro4 computer equipped with 64-bit Win7, 16 GB memory, 2.2GHz i7-6650U CPU, R (v.3.6.2), and RStudio (v.1.2.5033) was used with no other background processes. a Five analytical methods implemented in CIPR were compared to SingleR and scmap across 5 individual clusters. Data points indicate the identity scores calculated for each ImmGen reference cell subset by different methods. Color gradient specifies the identity score calculated by scmap method (gray indicates no significant mappings were found). As expected, CIPR’s all-gene Spearman’s/Pearson’s methods are highly concordant with SingleR pipeline. The results from CIPR logFC methods show an overall positive correlation with SingleR, where the highest scoring reference cell types in CIPR were similar to those calculated by SingleR and scmap. In some cases, scmap failed to find a significant association which may be due to its suboptimal power when a bulk reference data is used as input. b CIPR performs significantly faster than SingleR, and comparably to scmap in 5 separate tests. We benchmarked the runtime of SingleR function both with and without fine tuning feature. Scmap (short) measures the runtime of scmapcluster computational engine, whereas scmap (long) measures the runtime starting with the initial object creation. c CIPR utilizes less computer memory over time compared to (d) SingleR (no fine tuning) and (e) scmap
Fig. 4
Fig. 4
CIPR allows users to limit the analysis to highly variable reference genes to improve cluster annotations. As genes with variable expression profiles contain more information to discriminate cell types, we implemented a variance filtering parameter in CIPR. The user-defined variance threshold parameter instructs the algorithm to utilize the genes with variances above a certain quantile across the reference dataset, thus limiting the analysis to highly variable genes. Plots compare the CIPR results with or without variance thresholding when the all-genes Spearman’s method is used. Identity- and z-scores were calculated for clusters 05 (NK cells) and 15 (pDCs) using ImmGen reference and results for individual reference samples types are plotted as color-coded data points. Applying variance thresholding and increasing its stringency from top 10% to top 1% reduced the identity scores of low/intermediate-scoring reference cell subsets while the highest scoring reference cell subsets remained unaffected as evidenced by data points overlapping with y = x line for (a) cluster 05, and (b) cluster 15. Similar trends were observed for other clusters in analysis (not shown). The differential impact on identity scores of high- and low-scoring reference cell subsets lead to an increased z-score for the highest-scoring reference subsets for both (c) cluster 05 and (d) cluster 15. These findings suggest that variance thresholding can improve the discrimination of some reference cell subsets. Although the best thresholding value remains to be determined in individual studies, CIPR pipeline allows a level of flexibility to be adapted to different experimental contexts
Fig. 5
Fig. 5
Irrelevant reference subsets can be excluded to tailor CIPR pipeline to different user needs. CIPR pipeline allows users to easily exclude the reference subsets that are of no interest for the study at hand. Limiting the analysis only to the relevant reference subsets can increase the readability of the graphical outputs and may better differentiate closely related single cell clusters. To demonstrate this capability, we subsetted the scRNAseq dataset described in Fig. 1 to contain only T cells (as defined by the simultaneous expression of Cd3e and Cd4 or Cd8a marker genes). We then performed CIPR analyses with or without limiting the pipeline to T cell references within the ImmGen dataset. a Uniform manifold approximation and projection (UMAP) plot with 6 distinct single-cell clusters shows the heterogeneity within the T cell subsets in the tumor microenvironment. b Representative feature plots indicate that the clusters are composed of Cd4+ helper and Cd8a+ cytotoxic T cells some of which exhibited an activated phenotype (Ifng+ cells) while others appeared to have naïve-memory phenotype (Sell+ cells). Of note, cluster 06 is composed of Foxp3+ regulatory T cells (Tregs). c CIPR analysis using logFC dot product method shows that highest scoring reference subsets for cluster 06 are regulatory T cell subsets within the ImmGen reference data. d Graphs show that identity scores calculated by CIPR, SingleR and scmap are positively correlated for both cluster 01 (activated Cd8a+ cells) and cluster 06 (Tregs). For these analyses, the entire ImmGen reference data (296 samples spanning 20 different cell types) were used, and the calculations were performed at the cluster level as described above. e The positive correlation between different analytical approaches were stronger when the reference dataset was limited to T cell subsets (70 samples in ImmGen data). In general, the highest scoring reference cell subsets in CIPR also scored the highest in scmap and SingleR methods

Similar articles

See all similar articles

References

    1. Wang Y, Navin NE. Advances and applications of single-cell sequencing technologies. Mol Cell. 2015;58(4):598–609. - PMC - PubMed
    1. Hwang B, Lee JH, Bang D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med. 2018;50(8):96. - PMC - PubMed
    1. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat Methods. 2018;15(12):1053–1058. - PMC - PubMed
    1. Zhang Z, Luo D, Zhong X, Choi JH, Ma Y, Wang S, et al. SCINA: A Semi-Supervised Subtyping Algorithm of Single Cells and Bulk Samples. Genes (Basel). 2019;10(7). - PMC - PubMed
    1. Domanskyi S, Szedlak A, Hawkins NT, Wang J, Paternostro G, Piermarocchi C. Polled digital cell sorter (p-DCS): automatic identification of hematological cell types from single cell RNA-sequencing clusters. BMC Bioinformatics. 2019;20(1):369. - PMC - PubMed

LinkOut - more resources

Feedback