Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep;32(9):888-95.
doi: 10.1038/nbt.3000. Epub 2014 Aug 24.

Detecting and Correcting Systematic Variation in Large-Scale RNA Sequencing Data

Affiliations
Free PMC article

Detecting and Correcting Systematic Variation in Large-Scale RNA Sequencing Data

Sheng Li et al. Nat Biotechnol. .
Free PMC article

Abstract

High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.

Figures

Figure 1
Figure 1. Inter-site normalization and false positive DEGs
(a) Schematic plot of RNA-seq data from all 4 samples (A,B,C,D) and 6 sites (ILM1-6), followed by normalization and calling of all pairwise differentially expressed genes (DEGs). (b) Intersite false positive DEGs, by comparing the 4 replicate libraries made for a particular sample at one Illumina site to the replicates of the same sample from the other five sites, shown for all samples (A vs. A, B. vs. B, C vs. C, D vs. D). We compare six normalization methods: original (standard limma-voom processing only), and with additional processing by EDASeq, cqn, RUV2, SVA, PEER (bar color). Thresholds used for DEG calls: FDR: 0.05, FC: 2.0. One site (ILM3) showed the most false positives before correction, although other sites also showed thousands of false positive DEGs.
Figure 2
Figure 2. Evaluation of inter-site DEG reproducibility
For each of the six sites, all possible pairwise differential expression analyses were performed for all samples A to D, giving a total of six comparisons. We then assessed agreement across sites using different measures. (a) The Spearman rank correlation of the q-values from any two of the six sites are plotted, with color and shape indicating the samples compared. (b) Percentage of DEGs agreeing between two sites out of the union of DEGs detected at the two sites. (a-b) Along the x-axis we plot all 10 possible pairwise combinations of the 6 sites (ILM1 vs ILM2, etc.). (c) External validation by TaqMan using Matthews correlation coefficient as measurement. Along the x-axis we plot all 6 possible pairwise combinations of the 4 samples. Blue indicates the fraction of DEGs shared, the other colors represent the DEGS seen at only one of the sites. Different color and shape combinations represent the 6 sites.
Figure 3
Figure 3. Inter-site DEG detection and validation
(a) Schematic plot of the comparison between intra-site DEGs and inter-site DEGs. We show site ILM1 and the comparison of sample A vs. B as an example. Analogously, the analysis has been applied to all 6 sites and possible pairwise sample comparisons. (b) Spearman rank correlation of the adjusted p-value (q-value) for inter-site DEGs and intra-site DEGs. (c) Inter-site DEG validation by TaqMan, assessed by MCC for all six pairwise sample comparisons (A-B, A-C, A-D, B-C, B-D, C-D). (b,c) We compare six normalization methods: orginal (standard limma-voom processing only), and with additional processing by EDASeq, cqn, RUV2, SVA, PEER. Thresholds for DEG calls: FDR: 0.05, FC: 2.0.
Figure 4
Figure 4. MCC evaluation of intra-site DEG detections using TaqMan data
Each violin plot summarizes data points from 6 sites. We compare six normalization methods: orginal (standard limma-voom processing only), and with additional processing by EDASeq, cqn, RUV2, SVA, PEER. Thresholds for DEG calls: FDR: 0.05, FC: 2.0.
Figure 5
Figure 5
Examination of RNA-seq data quality identifies major sources of variation. (a) GC content distribution (sample A). X-axis is GC content (%) and y-axis is percentage of reads with the corresponding GC content. Point shapes distinguish replicates (1: unfilled circle; 5: unfilled triangle). (b) The greatest percentage of reads contributing to some GC content bin (0% to 100%). A sample with more reads contributing to a particular GC content bin (%) indicates an abundance of reads with that particular GC content. (c) Average base error rate across all sequencing bases (y-axis) across all sites (x-axis). (d) Coefficient of variation of the percentage of genebody coverage (y-axis), which is a measure of the evenness of coverage across all gene bodies for each site (x-axis). (e) The percentage of reads that covers each nucleotide position of all of genes scaled to 100 bins, from 5′ UTR to 3′ UTR for sample A:1-5. Replicate 1 displayed site-dependent variation in genebody coverage for ILM3 (3′ bias), whereas replicate 5 showed similar genebody coverage regardless of where it was sequenced, suggesting that genebody coverage is influenced by library preparation. (f) Nucleotide frequency versus position for aligned reads. The percentage of each base was plotted as a function of the read length for each base (A, G, C, T) for two replicates (1, 5) for all sites. Replicate 1 displayed site-dependent base composition frequencies, whereas replicate 5 showed similar base composition frequencies regardless of where it was sequenced, suggesting that base composition frequency is largely a result of library preparation. Only the 20th to the 100th bases are shown here; the full read range can be seen in Supplementary Fig. 4. Vertical facets stand for sample A-D. Site information for ILM1-6 is color-coded. Replicates 1-4 were prepared and sequenced independently at each site, whereas replicate 5 was prepared at a single site and then sequenced at a subset of all sites. Point shapes distinguish replicates.

Similar articles

See all similar articles

Cited by 66 articles

See all "Cited by" articles

References

    1. Irizarry RA, et al. Multiple-laboratory comparison of microarray platforms. Nature methods. 2005;2:345–350. - PubMed
    1. Wang H, He X, Band M, Wilson C, Liu L. A study of inter-lab and inter-platform agreement of DNA microarray data. BMC genomics. 2005;6:71. - PMC - PubMed
    1. Consortium M, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature biotechnology. 2006;24:1151–1161. - PMC - PubMed
    1. Casciano DA, Woodcock J. Empowering microarrays in the regulatory setting. Nature biotechnology. 2006;24:1103. - PubMed
    1. Ball CA, Brazma A. MGED standards: work in progres. Omics : a journal of integrative biology. 2006;10:138–144. - PubMed

Publication types

MeSH terms

Feedback