Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Oct 3;18(1):437.
doi: 10.1186/s12859-017-1847-x.

Tissue-aware RNA-Seq Processing and Normalization for Heterogeneous and Sparse Data

Free PMC article

Tissue-aware RNA-Seq Processing and Normalization for Heterogeneous and Sparse Data

Joseph N Paulson et al. BMC Bioinformatics. .
Free PMC article


Background: Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data - critical first steps for any subsequent analysis.

Results: We find that analysis of large RNA-Seq data sets requires both careful quality control and the need to account for sparsity due to the heterogeneity intrinsic in multi-group studies. We developed Yet Another RNA Normalization software pipeline (YARN), that includes quality control and preprocessing, gene filtering, and normalization steps designed to facilitate downstream analysis of large, heterogeneous RNA-Seq data sets and we demonstrate its use with data from the Genotype-Tissue Expression (GTEx) project.

Conclusions: An R package instantiating YARN is available at .

Keywords: Filtering; GTEx; Normalization; Preprocessing; Quality control; RNA-Seq.

Conflict of interest statement

Ethics approval and consent to participate

This work was conducted under dbGaP approved protocol #9112 (accession phs000424.v6.p1).

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Fig. 1
Fig. 1
Preprocessing workflow for large, heterogeneous RNA-Seq data sets, as applied to the GTEx data. The boxes on the right show the number of samples, genes, and tissue types at each step. First, samples were filtered using PCoA with Y-chromosome genes to test for correct annotation of the sex of each sample. PCoA was used to group or separate samples derived from related tissue regions. Genes were filtered to select a normalization gene set to preserve robust, tissue-dependent expression. Finally, the data were normalized using a global count distribution method to support cross-tissue comparison while minimizing within-group variability
Fig. 2
Fig. 2
PCoA analysis allows for grouping of subregions for greater power. Scatterplots of the first and second principal coordinates from principal coordinate analysis on major tissue regions. a Aorta, coronary artery, and tibial artery form distinct clusters. b Skin samples from two regions group together but are distinct from fibroblast cell lines, a result that holds up (c) when removing the fibroblasts
Fig. 3
Fig. 3
Six highly expressed tissue-specific genes that are removed upon tissue-agnostic filtering. Boxplots of continuity-corrected log2 counts for six tissue-specific genes (a-f). These genes are retained when considering tissue-specificity and not when filtering in an unsupervised manner. Colors represent different tissues. Examples include (a) MUC7, (b) REG3A, (c) AHSG, (d) GKN1, (e) SMCP, and (f) NPPB
Fig. 4
Fig. 4
Using a tissue-defined reference lowers root mean squared error. Boxplots of the RMSE comparing the log-transformed quantiles of each sample to the reference defined using (left) all tissues and samples and the (right) reference defined using samples of the same tissue

Similar articles

See all similar articles

Cited by 8 articles

  • A reference map of the human binary protein interactome.
    Luck K, Kim DK, Lambourne L, Spirohn K, Begg BE, Bian W, Brignall R, Cafarelli T, Campos-Laborie FJ, Charloteaux B, Choi D, Coté AG, Daley M, Deimling S, Desbuleux A, Dricot A, Gebbia M, Hardy MF, Kishore N, Knapp JJ, Kovács IA, Lemmens I, Mee MW, Mellor JC, Pollis C, Pons C, Richardson AD, Schlabach S, Teeking B, Yadav A, Babor M, Balcha D, Basha O, Bowman-Colin C, Chin SF, Choi SG, Colabella C, Coppin G, D'Amata C, De Ridder D, De Rouck S, Duran-Frigola M, Ennajdaoui H, Goebels F, Goehring L, Gopal A, Haddad G, Hatchi E, Helmy M, Jacob Y, Kassa Y, Landini S, Li R, van Lieshout N, MacWilliams A, Markey D, Paulson JN, Rangarajan S, Rasla J, Rayhan A, Rolland T, San-Miguel A, Shen Y, Sheykhkarimli D, Sheynkman GM, Simonovsky E, Taşan M, Tejeda A, Tropepe V, Twizere JC, Wang Y, Weatheritt RJ, Weile J, Xia Y, Yang X, Yeger-Lotem E, Zhong Q, Aloy P, Bader GD, De Las Rivas J, Gaudet S, Hao T, Rak J, Tavernier J, Hill DE, Vidal M, Roth FP, Calderwood MA. Luck K, et al. Nature. 2020 Apr;580(7803):402-408. doi: 10.1038/s41586-020-2188-x. Epub 2020 Apr 8. Nature. 2020. PMID: 32296183
  • MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets.
    Singh U, Hur M, Dorman K, Wurtele ES. Singh U, et al. Nucleic Acids Res. 2020 Feb 28;48(4):e23. doi: 10.1093/nar/gkz1209. Nucleic Acids Res. 2020. PMID: 31956905 Free PMC article.
  • Nongenic cancer-risk SNPs affect oncogenes, tumour-suppressor genes, and immune function.
    Fagny M, Platig J, Kuijjer ML, Lin X, Quackenbush J. Fagny M, et al. Br J Cancer. 2020 Feb;122(4):569-577. doi: 10.1038/s41416-019-0614-3. Epub 2019 Dec 6. Br J Cancer. 2020. PMID: 31806877 Free PMC article.
  • Personalised analytics for rare disease diagnostics.
    Anderson D, Baynam G, Blackwell JM, Lassmann T. Anderson D, et al. Nat Commun. 2019 Nov 21;10(1):5274. doi: 10.1038/s41467-019-13345-5. Nat Commun. 2019. PMID: 31754101 Free PMC article.
  • Understanding Tissue-Specific Gene Regulation.
    Sonawane AR, Platig J, Fagny M, Chen CY, Paulson JN, Lopes-Ramos CM, DeMeo DL, Quackenbush J, Glass K, Kuijjer ML. Sonawane AR, et al. Cell Rep. 2017 Oct 24;21(4):1077-1088. doi: 10.1016/j.celrep.2017.10.001. Cell Rep. 2017. PMID: 29069589 Free PMC article.
See all "Cited by" articles


    1. Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR. Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis. Cell. 2008;133:523–536. doi: 10.1016/j.cell.2008.03.029. - DOI - PMC - PubMed
    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. - DOI - PubMed
    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing. Science (80- ) 2008, 320:1344–1349. - PMC - PubMed
    1. Eisenberg E, Levanon EY. Human housekeeping genes, revisited. Trends Genet. 2013:569–74. - PubMed
    1. Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, Speleman F. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 2002;3:RESEARCH0034. doi: 10.1186/gb-2002-3-7-research0034. - DOI - PMC - PubMed

LinkOut - more resources