Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data

Mikhail Pomaznoy; Ashu Sethi; Jason Greenbaum; Bjoern Peters

doi:10.1038/s41598-019-52584-w

Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data

Sci Rep. 2019 Nov 8;9(1):16342. doi: 10.1038/s41598-019-52584-w.

Authors

Mikhail Pomaznoy¹, Ashu Sethi², Jason Greenbaum², Bjoern Peters^{2

3}

Affiliations

¹ Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, United States. mikhail@lji.org.
² Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, United States.
³ Department of Medicine, University of California San Diego, La Jolla, CA, United States.

Abstract

RNA-seq methods are widely utilized for transcriptomic profiling of biological samples. However, there are known caveats of this technology which can skew the gene expression estimates. Specifically, if the library preparation protocol does not retain RNA strand information then some genes can be erroneously quantitated. Although strand-specific protocols have been established, a significant portion of RNA-seq data is generated in non-strand-specific manner. We used a comprehensive stranded RNA-seq dataset of 15 blood cell types to identify genes for which expression would be erroneously estimated if strand information was not available. We found that about 10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored. We used parameters of read alignments of these genes to construct a machine learning model that can identify which genes in an unstranded dataset might have incorrect expression estimates and which ones do not. We also show that differential expression analysis of genes with biased expression estimates in unstranded read data can be recovered by limiting the reads considered to those which span exonic boundaries. The resulting approach is implemented as a package available at https://github.com/mikpom/uslcount .

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Acrylates
Base Sequence*
Blood Cells / metabolism
Computational Biology / methods*
Databases, Genetic
Gene Expression Profiling*
High-Throughput Nucleotide Sequencing
Machine Learning
Phenyl Ethers
Programming Languages
Research Design

Substances

Acrylates
MA 12
Phenyl Ethers

Abstract

Publication types

MeSH terms

Substances

Grants and funding