Variability in estimated gene expression among commonly used RNA-seq pipelines

Sci Rep. 2020 Feb 17;10(1):2734. doi: 10.1038/s41598-020-59516-z.

Abstract

RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets using numerical methods such as clustering, classification, regression, and differential expression analysis. Such approaches rely on the assumption that mRNA abundance estimates from RNA-seq are reliable estimates of true expression levels. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that nearly 88% of protein-coding genes have similar gene expression profiles across all pipelines. However, for >12% of protein-coding genes, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold when applied to exactly the same samples and the same set of RNA-seq reads. Expression fold changes are similarly affected. Many of the impacted genes are widely studied disease-associated genes. We show that impacted genes exhibit diverse patterns of discordance among pipelines, suggesting that many inter-pipeline differences contribute to overall uncertainty in mRNA abundance estimates. A concerted, community-wide effort will be needed to develop gold-standards for estimating the mRNA abundance of the discordant genes reported here. In the meantime, our list of discordantly evaluated genes provides an important resource for robust marker discovery and target selection.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • CCAAT-Enhancer-Binding Proteins / genetics*
  • CCAAT-Enhancer-Binding Proteins / metabolism
  • Exome Sequencing
  • Gene Expression Profiling
  • Gene Expression Regulation, Neoplastic*
  • Genetic Variation
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Neoplasm Proteins / genetics*
  • Neoplasm Proteins / metabolism
  • Neoplasms / genetics*
  • Neoplasms / metabolism
  • Neoplasms / pathology
  • Nuclear Proteins / genetics
  • Nuclear Proteins / metabolism
  • Nucleophosmin
  • Platelet-Derived Growth Factor / genetics*
  • Platelet-Derived Growth Factor / metabolism
  • Principal Component Analysis
  • Receptor, ErbB-2 / genetics*
  • Receptor, ErbB-2 / metabolism
  • Sequence Analysis, RNA
  • Splicing Factor U2AF / genetics
  • Splicing Factor U2AF / metabolism

Substances

  • CCAAT-Enhancer-Binding Proteins
  • CEBPA protein, human
  • Neoplasm Proteins
  • Nuclear Proteins
  • Platelet-Derived Growth Factor
  • Splicing Factor U2AF
  • U2AF1 protein, human
  • platelet-derived growth factor A
  • Nucleophosmin
  • ERBB2 protein, human
  • Receptor, ErbB-2