Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb 12;21(1):36.
doi: 10.1186/s13059-020-1949-z.

Robustness and Applicability of Transcription Factor and Pathway Analysis Tools on Single-Cell RNA-seq Data

Affiliations
Free PMC article

Robustness and Applicability of Transcription Factor and Pathway Analysis Tools on Single-Cell RNA-seq Data

Christian H Holland et al. Genome Biol. .
Free PMC article

Abstract

Background: Many functional analysis tools have been developed to extract functional and mechanistic insight from bulk transcriptome data. With the advent of single-cell RNA sequencing (scRNA-seq), it is in principle possible to do such an analysis for single cells. However, scRNA-seq data has characteristics such as drop-out events and low library sizes. It is thus not clear if functional TF and pathway analysis tools established for bulk sequencing can be applied to scRNA-seq in a meaningful way.

Results: To address this question, we perform benchmark studies on simulated and real scRNA-seq data. We include the bulk-RNA tools PROGENy, GO enrichment, and DoRothEA that estimate pathway and transcription factor (TF) activities, respectively, and compare them against the tools SCENIC/AUCell and metaVIPER, designed for scRNA-seq. For the in silico study, we simulate single cells from TF/pathway perturbation bulk RNA-seq experiments. We complement the simulated data with real scRNA-seq data upon CRISPR-mediated knock-out. Our benchmarks on simulated and real data reveal comparable performance to the original bulk data. Additionally, we show that the TF and pathway activities preserve cell type-specific variability by analyzing a mixture sample sequenced with 13 scRNA-seq protocols. We also provide the benchmark data for further use by the community.

Conclusions: Our analyses suggest that bulk-based functional analysis tools that use manually curated footprint gene sets can be applied to scRNA-seq data, partially outperforming dedicated single-cell tools. Furthermore, we find that the performance of functional analysis tools is more sensitive to the gene sets than to the statistic used.

Keywords: Benchmark; Functional analysis; Pathway analysis; Transcription factor analysis; scRNA-seq.

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Testing the robustness of DoRothEA (AB), PROGENy, and GO-GSEA against low gene coverage. a DoRothEA (AB) performance (area under ROC curve, AUROC) versus gene coverage. b PROGENy performance (AUROC) for different number of footprint genes per pathway versus gene coverage. c Performance (AUROC) of GO-GSEA versus gene coverage. The dashed line indicates the performance of a random model. The colors in a and c are meant only as a visual support to distinguish between the individual violin plots and jittered points
Fig. 2
Fig. 2
Benchmark results of TF and pathway analysis tools on simulated scRNA-seq data. a Simulation strategy of single cells from an RNA-seq bulk sample. b Example workflow of DoRothEA’s performance evaluation on simulated single cells for a specific parameter combination (number of cells = 10, mean library size = 5000). 1. Step: ROC-curves of DoRothEA’s performance on single cells (25 replicates) and on bulk data including only TFs with confidence level A. 2. Step: DoRothEA performance on single cells and bulk data summarized as AUROC vs TF coverage. TF coverage denotes the number of distinct perturbed TFs in the benchmark dataset that are also covered by the gene set resource (see Additional file 1: Figure S3a) Results are provided for different combinations of DoRothEA’s confidence levels (A, B, C, D, E). Error bars of AUROC values depict the standard deviation and correspond to different simulation replicates. Step 3: Averaged difference across all confidence level combinations between AUROC of single cells and bulk data for all possible parameter combinations. The letters within the tiles indicates which confidence level combination performs the best on single cells. The tile marked in red corresponds to the parameter setting used for previous plots (Steps 1 and 2). c D-AUCell and d metaVIPER performance on simulated single cells summarized as AUROC for a specific parameter combination (number of cells = 10, mean library size = 5000) and corresponding bulk data vs TF coverage. e, f Performance results of e PROGENy and f P-AUCell on simulated single cells for a specific parameter combination (number of cells = 10, mean library size = 5000) and corresponding bulk data in ROC space vs number of footprint genes per pathway. cf Plots revealing the change in performance for all possible parameter combinations (Step 3) are available in Additional file 1: Figure S7. bf The dashed line indicates the performance of a random model
Fig. 3
Fig. 3
Benchmark results of TF analysis tools on real scRNA-seq data. a Performance of DoRothEA, D-AUCell, metaVIPER, and SCENIC on all sub benchmark datasets in ROC space vs TF coverage. b Performance of DoRothEA, D-AUCell, and metaVIPER on all sub benchmark datasets in ROC vs TF coverage split up by combinations of DoRothEA’s confidence levels (A-E). a, b In both panels, the results for each tool are based on the same but for the respective panel different set of (shared) TFs. TF coverage reflects the number of distinct perturbed TFs in the benchmark data set that are also covered by the gene sets
Fig. 4
Fig. 4
Application of TF and pathway analysis tools on a representative scRNA-seq dataset of PBMCs and HEK cells. a Dendrogram showing how cell lines/cell types are clustered together based on different hierarchy levels. The dashed line marks the hierarchy level 2, where CD4 T cells, CD8 T cells, and NK cells are aggregated into a single cluster. Similarly, CD14+ monocytes, FCGR3A+ monocytes, and dendritic cells are also aggregated to a single cluster. The B cells and HEK cells are represented by separate, pure clusters. b, d Comparison of cluster purity (clusters are defined by hierarchy level 2) between the top 2000 highly variable genes and b TF activity and TF expression and d pathway activities. The dashed line in b separates SCENIC as it is not directly comparable to the other TF analysis tools and controls due to a different number of considered TFs. c UMAP plots of TF activities calculated with DoRothEA and corresponding TF expression measured by SMART-Seq2 protocol. e Heatmap of selected TF activities inferred with DoRothEA from gene expression data generated via Quartz-Seq2

Similar articles

See all similar articles

References

    1. Essaghir A, Toffalini F, Knoops L, Kallin A, van Helden J, Demoulin J-B. Transcription factor regulation can be accurately predicted from the presence of target gene signatures in microarray gene expression data. Nucleic Acids Res. 2010;38:e120. doi: 10.1093/nar/gkq149. - DOI - PMC - PubMed
    1. Hung J-H, Yang T-H, Hu Z, Weng Z, DeLisi C. Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinform. 2012;13:281–291. doi: 10.1093/bib/bbr049. - DOI - PMC - PubMed
    1. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012;8:e1002375. doi: 10.1371/journal.pcbi.1002375. - DOI - PMC - PubMed
    1. Nguyen T-M, Shafi A, Nguyen T, Draghici S. Identifying significantly impacted pathways: a comprehensive review and assessment. Genome Biol. 2019;20:203. doi: 10.1186/s13059-019-1790-4. - DOI - PMC - PubMed
    1. Liberzon A, Birger C, Thorvaldsdottir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1(6):417–425. doi: 10.1016/j.cels.2015.12.004. - DOI - PMC - PubMed

Publication types

LinkOut - more resources

Feedback