Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data

Sci Rep. 2019 Nov 8;9(1):16342. doi: 10.1038/s41598-019-52584-w.

Abstract

RNA-seq methods are widely utilized for transcriptomic profiling of biological samples. However, there are known caveats of this technology which can skew the gene expression estimates. Specifically, if the library preparation protocol does not retain RNA strand information then some genes can be erroneously quantitated. Although strand-specific protocols have been established, a significant portion of RNA-seq data is generated in non-strand-specific manner. We used a comprehensive stranded RNA-seq dataset of 15 blood cell types to identify genes for which expression would be erroneously estimated if strand information was not available. We found that about 10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored. We used parameters of read alignments of these genes to construct a machine learning model that can identify which genes in an unstranded dataset might have incorrect expression estimates and which ones do not. We also show that differential expression analysis of genes with biased expression estimates in unstranded read data can be recovered by limiting the reads considered to those which span exonic boundaries. The resulting approach is implemented as a package available at https://github.com/mikpom/uslcount .

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Acrylates
  • Base Sequence*
  • Blood Cells / metabolism
  • Computational Biology / methods*
  • Databases, Genetic
  • Gene Expression Profiling*
  • High-Throughput Nucleotide Sequencing
  • Machine Learning
  • Phenyl Ethers
  • Programming Languages
  • Research Design

Substances

  • Acrylates
  • MA 12
  • Phenyl Ethers