Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jul;15(3):413-26.
doi: 10.1093/biostatistics/kxt053. Epub 2014 Jan 6.

Differential Expression Analysis of RNA-seq Data at Single-Base Resolution

Free PMC article

Differential Expression Analysis of RNA-seq Data at Single-Base Resolution

Alyssa C Frazee et al. Biostatistics. .
Free PMC article

Erratum in

  • Biostatistics. 2014 Jul;15(3):584-5


RNA-sequencing (RNA-seq) is a flexible technology for measuring genome-wide expression that is rapidly replacing microarrays as costs become comparable. Current differential expression analysis methods for RNA-seq data fall into two broad classes: (1) methods that quantify expression within the boundaries of genes previously published in databases and (2) methods that attempt to reconstruct full length RNA transcripts. The first class cannot discover differential expression outside of previously known genes. While the second approach does possess discovery capabilities, statistical analysis of differential expression is complicated by the ambiguity and variability incurred while assembling transcripts and estimating their abundances. Here, we propose a novel method that first identifies differentially expressed regions (DERs) of interest by assessing differential expression at each base of the genome. The method then segments the genome into regions comprised of bases showing similar differential expression signal, and then assigns a measure of statistical significance to each region. Optionally, DERs can be annotated using a reference database of genomic features. We compare our approach with leading competitors from both current classes of differential expression methods and highlight the strengths and weaknesses of each. A software implementation of our method is available on github (

Keywords: Bioinformatics; Differential expression; False discovery rate; Genomics; RNA sequencing.


Fig. 1.
Fig. 1.
(a) Structures of annotated transcripts in a 6 kb region of the human genome (corresponding gene ID: ENSG00000099917). A transcript structure this complex causes problems in annotate-then-identify pipelines, as there is no clear way to determine which transcript or exon generated each read, especially if there is a high degree of overlap between unique features, as shown in (b): here, we zoom in on the exon on the right-hand side of (a) and see four overlapping yet distinct regions. Biologically, this could indicate a single exon with a varying transcription end site, but analytically, it introduces four potential counting regions and requires a critical counting decision to be made. Using a method like DER Finder eliminates the need for these decisions: if just one transcript or one form of an exon is differentially expressed, the genomic regions that uniquely identify that transcript or exon form will be called differentially expressed, and further analysis can be done on the small region to determine the exact phenomenon causing the observed pattern.
Fig. 2.
Fig. 2.
Cases where DER Finder correctly calls differential expression and annotate-then-identify methods do not. (a) Example of an exon (from gene EIF1AY, Ensembl exon id ENSE00001435537) whose location appears to be mis-annotated, leading EdgeR and DESeq to underestimate the exon's abundance and therefore incorrectly call this exon not differentially expressed. (b) Example of a DER (formula image) falling outside of an annotated exon, which can be found by DER Finder but not by EdgeR or DESeq. Although there are no annotated exons in this region, we believe this finding is more than noise because it is supported by the following annotated ESTs: DR001278, BF693629, BF672674, BM683941, BM931807, and CD356860 (GenBank accession numbers. Top panels: single-base resolution coverage (on log2 scale). Middle panels: formula image-statistics from linear model fit by DER Finder. Bottom panels: exon locations and state calls from DER Finder: light gray formula image not expressed, black formula image equally expressed, red or dark gray formula image overexpressed in men).
Fig. 3.
Fig. 3.
formula image plots for Y-chromosome regions, transcripts, or exons, for each method and for both male vs. female (red) and male vs. male (blue) comparisons. On each plot, the formula image-axis represents the average log (base 2) abundance for each unit (region for DER Finder, transcript for Cufflinks, exon for EdgeR and DESeq), and the formula image-axis represents the log (base 2) fold change between males and females (red points) or the two groups of males (blue points). We expect to see the red, positively sloped diagonal on all plots: this represents genomic regions expressed in males but not in females. In DER Finder, EdgeR, and DESeq, this diagonal corresponds with differential expression detected, however, no differential expression was detected in Cufflinks even though the red diagonal exists as expected. The displayed formula image and formula image values for EdgeR and DESeq are normalized. Specifically, the EdgeR plot is logCPM vs. logFC, where logCPM is formula image counts-per-million and logFC is the formula image fold change (male to female); both are normalized for library size and dispersion and are reported in the output of the exactTest function. The DESeq plot is log2(baseMeanA+0.5) + log2(baseMeanB + 0.5))/2 vs log2(baseMeanA+0.5) − log2(baseMeanB +0.5), where baseMeanA and baseMeanB represent library-size-normalized counts for males and females, respectively, and are reported in the output table from the function nbinomTest. Since baseMeanA and baseMeanB were sometimes 0, we added 0.5 as an offset to avoid calculating formula image.
Fig. 4.
Fig. 4.
Percentage of significantly DERs/transcripts/exons originating from male-to-female comparisons, using various percentiles of the formula image-value distribution as a significance cutoff. We find that most highly significant results are true positives, i.e. results with low formula image-values and high test statistics stem from comparing males with females, for DER Finder, EdgeR, and DESeq, while Cufflinks exhibits problems in this area.

Similar articles

See all similar articles

Cited by 19 articles

See all "Cited by" articles


    1. Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11:R106. - PMC - PubMed
    1. Anders S., Reyes A., Huber W. Detecting differential usage of exons from RNA-seq data. Genome Research. 2012;22(10):2008–2017. - PMC - PubMed
    1. Bullard J., Purdom E., Hansen K. D., Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics. 2010;11 R package version 1.8.0. - PMC - PubMed
    1. Clark M. B., Amaral P. P., Schlesinger F. J., Dinger M. E., Taft R. J., Rinn J. L., Ponting C. P., Stadler P. F., Morris K. V., Morillon A. The reality of pervasive transcription. PLoS Biology. 2011;9(7):e1000625. and others. - PMC - PubMed
    1. Dudoit S., Yang Y. H., Callow M. J., Speed T. P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica. 2002;12(1):111–140.

Publication types