Model-based autoencoders for imputing discrete single-cell RNA-seq data

Tian Tian; Martin Renqiang Min; Zhi Wei

doi:10.1016/j.ymeth.2020.09.010

Model-based autoencoders for imputing discrete single-cell RNA-seq data

Methods. 2021 Aug:192:112-119. doi: 10.1016/j.ymeth.2020.09.010. Epub 2020 Sep 22.

Authors

Tian Tian¹, Martin Renqiang Min², Zhi Wei³

Affiliations

¹ Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States. Electronic address: tt72@njit.edu.
² NEC Laboratories America, Princeton, NJ 08540, United States. Electronic address: renqiang@nec-labs.com.
³ Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States. Electronic address: zhiwei@njit.edu.

Abstract

Deep neural networks have been widely applied for missing data imputation. However, most existing studies have been focused on imputing continuous data, while discrete data imputation is under-explored. Discrete data is common in real world, especially in research areas of bioinformatics, genetics, and biochemistry. In particular, large amounts of recent genomic data are discrete count data generated from single-cell RNA sequencing (scRNA-seq) technology. Most scRNA-seq studies produce a discrete matrix with prevailing 'false' zero count observations (missing values). To make downstream analyses more effective, imputation, which recovers the missing values, is often conducted as the first step in pre-processing scRNA-seq data. In this paper, we propose a novel Zero-Inflated Negative Binomial (ZINB) model-based autoencoder for imputing discrete scRNA-seq data. The novelties of our method are twofold. First, in addition to optimizing the ZINB likelihood, we propose to explicitly model the dropout events that cause missing values by using the Gumbel-Softmax distribution. Second, the zero-inflated reconstruction is further optimized with respect to the raw count matrix. Extensive experiments on simulation datasets demonstrate that the zero-inflated reconstruction significantly improves imputation accuracy. Real data experiments show that the proposed imputation can enhance separating different cell types and improve the accuracy of differential expression analysis.

Keywords: Deep learning; Imputation; scRNA-seq.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Computational Biology
Computer Simulation
RNA-Seq
Sequence Analysis, RNA*
Single-Cell Analysis*

Grants and funding

UL1 TR003017/TR/NCATS NIH HHS/United States