Machine learning for cell type classification from single nucleus RNA sequencing data

Huy Le; Beverly Peng; Janelle Uy; Daniel Carrillo; Yun Zhang; Brian D Aevermann; Richard H Scheuermann

doi:10.1371/journal.pone.0275070

Machine learning for cell type classification from single nucleus RNA sequencing data

PLoS One. 2022 Sep 23;17(9):e0275070. doi: 10.1371/journal.pone.0275070. eCollection 2022.

Authors

Huy Le¹, Beverly Peng¹, Janelle Uy¹, Daniel Carrillo¹, Yun Zhang², Brian D Aevermann², Richard H Scheuermann^{2

3

4}

Affiliations

¹ Department of Bioengineering, University of California, San Diego, CA, United States of America.
² Department of informatics, J. Craig Venter Institute, La Jolla, CA, United States of America.
³ Department of Pathology, University of California, San Diego, CA, United States of America.
⁴ La Jolla Institute for Immunology, San Diego, CA, United States of America.

Abstract

With the advent of single cell/nucleus RNA sequencing (sc/snRNA-seq), the field of cell phenotyping is now a data-driven exercise providing statistical evidence to support cell type/state categorization. However, the task of classifying cells into specific, well-defined categories with the empirical data provided by sc/snRNA-seq remains nontrivial due to the difficulty in determining specific differences between related cell types with close transcriptional similarities, resulting in challenges with matching cell types identified in separate experiments. To investigate possible approaches to overcome these obstacles, we explored the use of supervised machine learning methods-logistic regression, support vector machines, random forests, neural networks, and light gradient boosting machine (LightGBM)-as approaches to classify cell types using snRNA-seq datasets from human brain middle temporal gyrus (MTG) and human kidney. Classification accuracy was evaluated using an F-beta score weighted in favor of precision to account for technical artifacts of gene expression dropout. We examined the impact of hyperparameter optimization and feature selection methods on F-beta score performance. We found that the best performing model for granular cell type classification in both datasets is a multinomial logistic regression classifier and that an effective feature selection step was the most influential factor in optimizing the performance of the machine learning pipelines.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Humans
Logistic Models
Machine Learning*
RNA*
RNA, Small Nuclear
Sequence Analysis, RNA / methods
Support Vector Machine

Substances

RNA, Small Nuclear
RNA

Grants and funding

RF1 MH123220/MH/NIMH NIH HHS/United States