Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Sep 18;43(16):7664-74.
doi: 10.1093/nar/gkv736. Epub 2015 Jul 21.

How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets

Affiliations

How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets

Lucia Peixoto et al. Nucleic Acids Res. .

Abstract

The sequencing of the full transcriptome (RNA-seq) has become the preferred choice for the measurement of genome-wide gene expression. Despite its widespread use, challenges remain in RNA-seq data analysis. One often-overlooked aspect is normalization. Despite the fact that a variety of factors or 'batch effects' can contribute unwanted variation to the data, commonly used RNA-seq normalization methods only correct for sequencing depth. The study of gene expression is particularly problematic when it is influenced simultaneously by a variety of biological factors in addition to the one of interest. Using examples from experimental neuroscience, we show that batch effects can dominate the signal of interest; and that the choice of normalization method affects the power and reproducibility of the results. While commonly used global normalization methods are not able to adequately normalize the data, more recently developed RNA-seq normalization can. We focus on one particular method, RUVSeq and show that it is able to increase power and biological insight of the results. Finally, we provide a tutorial outlining the implementation of RUVSeq normalization that is applicable to a broad range of studies as well as meta-analysis of publicly available data.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Unwanted variation dominates the signal in RNA-seq studies in experimental neuroscience. PCA plots of gene counts normalized using either upper-quantile (UQ) or FPKM from publicly available datasets from the mouse hippocampus. (A) GSE0261, mRNA-Seq of wild-type (in red) versus knock-out mice (in blue). A severe batch effect is observed in the WT samples (40). (B) GSE0262, small RNA-Seq of wild-type (in red) versus knock-out mice (in blue). A severe batch effect is observed in the WT and KO samples (40). (C) GSE58797, mRNA-seq of mice injected with shRNA to knock down expression of a gene (green), scrambled shRNA (red, controls) and injected with shRNA to knock down expression of a gene and submitted to contextual fear conditioning (FC, blue). A batch effect can be observed in the controls, and there's no separation between FC and naïve injected animals (41). (D) GSE61915, mRNA-seq of young (3 weeks, blue) versus old (24 weeks, red) animals. Proper grouping of treatment samples is observed (42). (E) GSE53380, mRNA-seq of wild-type (control, in red), KO animals (in blue), WT animals following novel-object recognition (NOR, purple) and KO animals following NOR (green). One control sample is an outlier, no separation is observed among all other samples (43). (F) GSE65159, mRNA-seq of animals 2 weeks (2wk,red) and 6 weeks (6wk, blue) following the induction of p25 expression (mouse model of Alzheimer's disease, AD) an their respective controls (green and purple). As expected no difference is observed in time without induction of neurodegeneration, proper separation of samples by treatment is observed in the AD mouse model (44). (G) GSE58343, mRNA-seq of home cage (HC, blue) and fear-conditioned animals (FC, red). Includes pair-end (PE) and single-end (SE) technical replicates, RNA obtained from neuronal dendrites (dend) vs. soma, and RNA following ribosome imuno-precipitation (IP) versus supernatant of the same sample (SN). There is no separation between HC and FC samples, or IP and SN samples (45). (H) GSE44229. mRNA-seq of home-cage (HC, red) versus animals obtained following object location memory (OLM, blue). There's no separation between HC and OLM samples (24).
Figure 2.
Figure 2.
RUV normalization corrects for unwanted variation in FC data. In red control samples matched for time of day (CC), in blue samples obtained 30 min after memory acquisition (FC), in green samples obtained 30 min after memory retrieval (RT). (A) Relative log expression (RLE) plot of all samples following traditional upper-quartile normalization (UQ). (B) RLE plots following normalization with RUV using negative controls and samples (RUVs). (C) Scatterplot of first two principal components (log-scaled, centered counts) following UQ normalization. The first two PCs explained 66% and 6% of the variance, respectively. (D) Scatterplot of first two principal components following RUVs normalization. The first two PCs explained 19.9% and 13.1% of the variance, respectively. Samples do not cluster according to treatment following UQ normalization but do so after applying RUVs. UQ normalization and RLE and PCA plots were performed using the R/Bioconductor package EDASeq (v. 2.0.0). RUVs normalization was performed using the R/Bioconductor package RUVSeq (v. 1.0.0).
Figure 3.
Figure 3.
RUV normalization corrects for unwanted variation in GEO datasets. PCA plots of RUVs normalized gene counts (using all genes as negative controls) from publicly available datasets from the mouse hippocampus. (A) GSE0261, mRNA-Seq of wild-type (in red) versus knock-out mice (in blue). Batch effect no longer evident (40). (B) GSE0262, small RNA-Seq of wild-type (in red) versus knock-out mice (in blue). Batch effect no longer evident (40). (C) GSE58797, mRNA-seq of mice injected with shRNA to knock down expression of a gene (green), scrambled shRNA (red, controls) and injected with shRNA to knock down expression of a gene and submitted to contextual fear conditioning (FC, blue). Batch effect no longer evident (41). (D) GSE61915, mRNA-seq of young (3 weeks, blue) versus old (24 weeks, red) animals. Proper grouping of treatment samples is mantained (42). (E) GSE53380, mRNA-seq of wild-type (control, in red), KO animals (in blue), WT animals following novel-object recognition (NOR, purple) and KO animals following NOR (green). Proper grouping of experimental conditions is improved (43). (F) GSE65159, mRNA-seq of animals 2 weeks (2wk,red) and 6 weeks (6wk, blue) following the induction of p25 expression (mouse model of Alzheimer's disease, AD) an their respective controls (green and purple). As expected no difference is observed in time without induction of neurodegeneration, proper separation of samples by treatment is improved (44). (G) GSE58343, mRNA-seq of home cage (HC, blue) and fear-conditioned animals (FC, red). Includes pair-end (PE) and single-end (SE) technical replicates, RNA obtained from neuronal dendrites (dend) versus soma, and RNA following ribosome imuno-precipitation (IP) versus supernatant of the same sample (SN). Separation separation between HC and FC samples, as well as IP and SN samples is improved (45). (H) GSE44229. mRNA-seq of home-cage (HC, red) versus animals obtained following object location memory (OLM, blue). Batch effect no longer present (24).
Figure 4.
Figure 4.
Normalization impacts differential expression after contextual fear conditioning. (A) Distribution of unadjusted edgeR p-values for tests of differential expression between FC and CC samples following UQ normalization. (B) Distribution of unadjusted edgeR P-values for tests of differential expression between FC and CC samples following UQ normalization. The distribution of P-values following UQ normalization is far from the expected uniform. RUV returns uniformity to the p-value distribution and increases discovery of differentially expressed genes (genes that have a low P-value). (C) Volcano plot of differential expression (−log10P-value versus log fold change) of UQ normalized samples. (D) Volcano plot of differential expression of RUVs normalized samples. Genes with and FDR <0.01 are highlighted in blue. Positive controls are circled in red, negative controls are circled in green (Table S2). RUV increases the detection of known differentially expressed genes from 60% to 94%. Differential expression analysis was performed using R/Bioconductor package edgeR (v. 3.8.5).
Figure 5.
Figure 5.
RUV increases concordance of RNA-seq and microarray differential expression following fear conditioning. Y-axis: number of genes in agreement between microarray and RNA-seq data at any given rank. X-axis: differential expression rank (low to high P-value). In red: differentially expressed genes obtained using edgeR for UQ normalized RNA-seq data relative to those detected by microarrays using limma. In blue: differentially expressed genes obtained using edgeR for RUVs normalized RNA-seq data relative to those detected by microarrays using limma. The agreement between technologies on the top 100 differentially expressed genes doubles with RUVs normalization.
Figure 6.
Figure 6.
RUV allows removal of laboratory specific effects for combined analysis of gene expression changes following FC and OLM. In red control samples matched for time of day (CC), in blue samples obtained 30 min after memory acquisition (FC), in green samples obtained after object location memory (OLM). (A) Relative log expression (RLE) plot of all samples following upper-quartile normalization (UQ). (B) RLE plots following normalization with RUV using negative controls and samples (RUVs). (C) Scatterplot of first two principal components (log-scaled, centered counts) following UQ normalization. The first two PCs explained 73.4% and 9.6% of the variance, respectively. (D) Scatterplot of first two principal components following RUVs normalization. The first two PCs explained 15.5% and 9.4% of the variance, respectively. Samples cluster according to laboratory following UQ normalization but cluster according to treatment after applying RUVs.
Figure 7.
Figure 7.
Quantitative and qualitative effects of the choice of normalization method in combined analysis of gene expression changes following FC and OLM. (A) Number of genes and enriched KEGG pathways for OLM and FC relative to combined controls following UQ normalization. UQ normalization leads to inferring housekeeping genes as differentially expressed. (B) Number of genes and enriched KEGG pathways for OLM and FC relative to combined controls following RUVs normalization. The apparent regulation of housekeeping genes has been removed.
Figure 8.
Figure 8.
Step-by-step outline of the application of RUV to normalization of RNA-seq data.

Similar articles

Cited by

References

    1. Leek J.T., Scharpf R.B., Bravo H.C., Simcha D., Langmead B., Johnson W.E., Geman D., Baggerly K., Irizarry R.A. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 2010;11:733–739. - PMC - PubMed
    1. Bullard J.H., Purdom E., Hansen K.D., Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. - PMC - PubMed
    1. Dillies M.A., Rau A., Aubert J., Hennequet-Antier C., Jeanmougin M., Servant N., Keime C., Marot G., Castel D., Estelle J., et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinformatics. 2013;14:671–683. - PubMed
    1. Robinson M.D., Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11:R25. - PMC - PubMed
    1. Mortazavi A., Williams B.A., McCue K., Schaeffer L., Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods. 2008;5:621–628. - PubMed

Publication types