Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jan 5:17:28.
doi: 10.1186/s12864-015-2353-z.

Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster

Affiliations

Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster

Yanzhu Lin et al. BMC Genomics. .

Abstract

Background: A generally accepted approach to the analysis of RNA-Seq read count data does not yet exist. We sequenced the mRNA of 726 individuals from the Drosophila Genetic Reference Panel in order to quantify differences in gene expression among single flies. One of our experimental goals was to identify the optimal analysis approach for the detection of differential gene expression among the factors we varied in the experiment: genotype, environment, sex, and their interactions. Here we evaluate three different filtering strategies, eight normalization methods, and two statistical approaches using our data set. We assessed differential gene expression among factors and performed a statistical power analysis using the eight biological replicates per genotype, environment, and sex in our data set.

Results: We found that the most critical considerations for the analysis of RNA-Seq read count data were the normalization method, underlying data distribution assumption, and numbers of biological replicates, an observation consistent with previous RNA-Seq and microarray analysis comparisons. Some common normalization methods, such as Total Count, Quantile, and RPKM normalization, did not align the data across samples. Furthermore, analyses using the Median, Quantile, and Trimmed Mean of M-values normalization methods were sensitive to the removal of low-expressed genes from the data set. Although it is robust in many types of analysis, the normal data distribution assumption produced results vastly different than the negative binomial distribution. In addition, at least three biological replicates per condition were required in order to have sufficient statistical power to detect expression differences among the three-way interaction of genotype, environment, and sex.

Conclusions: The best analysis approach to our data was to normalize the read counts using the DESeq method and apply a generalized linear model assuming a negative binomial distribution using either edgeR or DESeq software. Genes having very low read counts were removed after normalizing the data and fitting it to the negative binomial distribution. We describe the results of this evaluation and include recommended analysis strategies for RNA-Seq read count data.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Flow chart showing analysis approach
Fig. 2
Fig. 2
Effect of removing low-expressed genes before or after read count normalization on differential gene expression. The cell plot shows the percentage of agreement between Workflow 1 and Workflow 2. a Generalized linear model using DESeq. b Generalized linear model using edgeR. c ANOVA using SAS. Abbreviations for normalization methods are the same as defined in Fig. 1
Fig. 3
Fig. 3
Examples of differences observed in normalization methods. a Boxplots of individual RAL-320 males of Environment 2. b Boxplots of the coefficient of variation for RAL-900 females of Environment 3. c Boxplots of the coefficient of variation for RAL-900 males of Environment 3. A complete set of box plots can be found in Additional files 2 and 3. Abbreviations for normalization methods are the same as defined in Fig. 1. It has come to our attention that the line number designation for the Drosophila Genetic Reference Panel has been officially changed in Flybase. Specifically, the lines used to have a “RAL-” prefix; they now have “DGRP-” as the prefix (for example, “RAL-320” is now “DGRP-320”). We have used the “RAL-” prefix several times in our manuscript, in Fig. 3, and in Additional files 2 and 3. Future usage would be with the “DGRP –“prefix
Fig. 4
Fig. 4
Comparison between the number and identity of differential genes estimated by DESeq and edgeR. The cell plot shows the percentage overlap between the two programs for each normalization method. a Workflow 1. b Workflow 2. c No filtering. Abbreviations for normalization methods are the same as defined in Fig. 1
Fig. 5
Fig. 5
Comparison of the number and identity of differentially expressed genes obtained using the generalized linear model (GLM) with those obtained using the ANOVA model. The graph shows the percentage of differentially expressed genes for the Genotype × Environment × Sex term that agree between the GLM and ANOVA methods. Dark blue bars, overlap of DESeq GLM and ANOVA for Workflow 1; Light blue bars, overlap of DESeq GLM and ANOVA for Workflow 2; Medium blue bars, overlap of DESeq GLM and ANOVA for Workflow 3; Dark red bars, overlap of edgeR GLM and ANOVA for Workflow 1; Light pink bars, overlap of edgeR GLM and ANOVA for Workflow 2; Medium pink bars, overlap of edgeR GLM and ANOVA for Workflow 3. Abbreviations for normalization methods are the same as defined in Fig. 1
Fig. 6
Fig. 6
Statistical power analysis. a Detectable fold-change versus statistical power for n = 2, 3, 4, 5, 6, 7, and 8 flies per genotype/environment/sex. b Estimated variance
Fig. 7
Fig. 7
Empirical analysis approach using a reduced data set. The percentage of genes overlapping with the full data set is plotted against different false discovery rate thresholds for n = 2, 3, and 5 flies per genotype/environment/sex

Similar articles

Cited by

References

    1. Auer PL, Srivastava S, Doerge RW. Differential expression-the next generation and beyond. Brief Funct Genomics. 2011;2:57–62. - PubMed
    1. McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, Young LJ, et al. RNA-seq: technical variability and sampling. BMC Genomics. 2011;12:293–306. doi: 10.1186/1471-2164-12-293. - DOI - PMC - PubMed
    1. Bloom JS, Khan Z, Kruglyak L, Singh M, Caudy AA. Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics. 2009;10:221. doi: 10.1186/1471-2164-10-221. - DOI - PMC - PubMed
    1. Malone JH, Oliver B. Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol. 2011;9:34. doi: 10.1186/1741-7007-9-34. - DOI - PMC - PubMed
    1. Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011;21:1543–1551. doi: 10.1101/gr.121095.111. - DOI - PMC - PubMed

Publication types