Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Koki Tsuyuzaki; Hiroyuki Sato; Kenta Sato; Itoshi Nikaido

doi:10.1186/s13059-019-1900-3

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Genome Biol. 2020 Jan 20;21(1):9. doi: 10.1186/s13059-019-1900-3.

Authors

Koki Tsuyuzaki^{1

2}, Hiroyuki Sato³, Kenta Sato^{4

5}, Itoshi Nikaido^{6

7}

Affiliations

¹ Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, 351-0198, Japan. koki.tsuyuzaki@gmail.com.
² Japan Science and Technology Agency, PRESTO, 5-3, Yonbancho, Chiyoda-ku, Tokyo, 102-8666, Japan. koki.tsuyuzaki@gmail.com.
³ Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Yoshida-honmachi, Sakyo-ku, Kyoto, 606-8501, Japan.
⁴ Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, 351-0198, Japan.
⁵ Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo, 113-8657, Japan.
⁶ Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, 351-0198, Japan. itoshi.nikaido@riken.jp.
⁷ Bioinformatics Course, Master's/Doctoral Program in Life Science Innovation (T-LSI), School of Integrative and Global Majors (SIGMA), University of Tsukuba, 1-1-1, Tennodai, Tsukuba, Ibaraki, 305-8577, Japan. itoshi.nikaido@riken.jp.

Abstract

Background: Principal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) datasets, but for large-scale scRNA-seq datasets, computation time is long and consumes large amounts of memory.

Results: In this work, we review the existing fast and memory-efficient PCA algorithms and implementations and evaluate their practical application to large-scale scRNA-seq datasets. Our benchmark shows that some PCA algorithms based on Krylov subspace and randomized singular value decomposition are fast, memory-efficient, and more accurate than the other algorithms.

Conclusion: We develop a guideline to select an appropriate PCA implementation based on the differences in the computational environment of users and developers.

Keywords: Cellular heterogeneity; Dimension reduction; Julia; Online/incremental algorithm; Out-of-core; Principal component analysis; Python; R; Randomized algorithm; Single-cell RNA-seq; Sparse data format.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Benchmarking
Principal Component Analysis*
RNA-Seq / methods*
Single-Cell Analysis / methods*