Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan 19:5:e2888.
doi: 10.7717/peerj.2888. eCollection 2017.

Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization

Affiliations

Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization

Xun Zhu et al. PeerJ. .

Abstract

Single-cell RNA-Sequencing (scRNA-Seq) is a fast-evolving technology that enables the understanding of biological processes at an unprecedentedly high resolution. However, well-suited bioinformatics tools to analyze the data generated from this new technology are still lacking. Here we investigate the performance of non-negative matrix factorization (NMF) method to analyze a wide variety of scRNA-Seq datasets, ranging from mouse hematopoietic stem cells to human glioblastoma data. In comparison to other unsupervised clustering methods including K-means and hierarchical clustering, NMF has higher accuracy in separating similar groups in various datasets. We ranked genes by their importance scores (D-scores) in separating these groups, and discovered that NMF uniquely identifies genes expressed at intermediate levels as top-ranked genes. Finally, we show that in conjugation with the modularity detection method FEM, NMF reveals meaningful protein-protein interaction modules. In summary, we propose that NMF is a desirable method to analyze heterogeneous single-cell RNA-Seq data. The NMF based subpopulation detection package is available at: https://github.com/lanagarmire/NMFEM.

Keywords: Clustering; Feature gene; Heterogeneity; Modularity; Non-negative matrix factorization; RNA-Seq; Single cell; Single cell sequencing; Single-cell; Subpopulation.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. The workflow of NMFEM.
The input can be either FASTQ files or a raw counts table. If FASTQ files are used, they are aligned using TopHat and counted using FeatureCounts (steps shown in brackets). The input or calculated raw counts table are filtered by samples and genes, converted into FPKMs using gene lengths, and normalized by samples. We then run NMF method on them to detect groups of cells, and find the feature genes separating the detected groups. Finally, we feed the feature genes as seed genes in FEM, and generate PPI gene modules that contain highly differentially expressed genes.
Figure 2
Figure 2. Rand measures comparison of all methods on five datasets
(A) Mouse embryonic lung epithelial E14.5 vs E16.5 (B) HSC vs. MPP1 (C) Glioblastoma MGH29 vs MGH31 (D) Bone marrow dendritic cells (CDP vs MDP) (E) human induced pluripotent stem cell (iPSC) lines with UMI counts. Rand measure ranges from 0 to 1, where a higher value indicates a greater clustering accuracy. The error bars show the standard deviation across 30 runs. Results significantly worse than NMF without tSNE by Welch t-test are marked by asterisks. For datasets with more than two groups of cells, the closest pair is selected.
Figure 3
Figure 3. MA-plots of significant or important genes identified by different methods
Shown are scRNA-Seq data in the mouse lung distal epithelial cell E14.5 vs. E16.5 samples. The blue color highlights the genes selected as “the most significant” by the corresponding methods. X-axis (A-value) is the mean of the gene expression, and y-axis (M-value) is the difference of the gene expression between E16.5 and E14.5 stages.
Figure 4
Figure 4. Network of top 5 modules using the seed genes generated by NMF.
Shown are module detection results in the FEM package, using the top 500 most important genes detected by NMF in Fig. 3. scRNA-Seq data in the mouse lung distal epithelial cell E14.5 vs. E16.5 samples are compared, where the red and blue colors indicate up- and down-regulation of genes in E16.5 relative to E14.5, respectively. The top five modules are selected by the p-values calculated from the internal Monte Carlo method in the FEM package.

Similar articles

Cited by

References

    1. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Research. 2013;41:D991–D995. doi: 10.1093/nar/gks1193. - DOI - PMC - PubMed
    1. Biase F, Cao X, Zhong S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome Research. 2014;24(11):1787–1796. doi: 10.1101/gr.177725.114. - DOI - PMC - PubMed
    1. Blake-Palmer KG, Su Y, Smith AN, Karet FE. Molecular cloning and characterization of a novel form of the human vacuolar H+-ATPase e-subunit: an essential proton pump component. Gene. 2007;393:94–100. doi: 10.1016/j.gene.2007.01.020. - DOI - PubMed
    1. Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC. Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods. 2013;10(11):1093–1095. doi: 10.1038/nmeth.2645. - DOI - PubMed
    1. Brunet J-P, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences of the United States of America. 2004;101:4164–4169. doi: 10.1073/pnas.0308531101. - DOI - PMC - PubMed