Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep;13(4):734-47.
doi: 10.1093/biostatistics/kxs001. Epub 2012 Feb 21.

Modeling RNA degradation for RNA-Seq with applications

Affiliations

Modeling RNA degradation for RNA-Seq with applications

Lin Wan et al. Biostatistics. 2012 Sep.

Abstract

RNA-Seq is widely used in biological and biomedical studies. Methods for the estimation of the transcript's abundance using RNA-Seq data have been intensively studied, many of which are based on the assumption that the short-reads of RNA-Seq are uniformly distributed along the transcripts. However, the short-reads are found to be nonuniformly distributed along the transcripts, which can greatly reduce the accuracies of these methods based on the uniform assumption. Several methods are developed to adjust the biases induced by this nonuniformity, utilizing the short-read's empirical distribution in transcript. As an alternative, we found that RNA degradation plays a major role in the formation of the short-read's nonuniform distribution and thus developed a new approach that quantifies the short-read's nonuniform distribution by precisely modeling RNA degradation. Our model of RNA degradation fits RNA-Seq data quite well, and based on this model, a new statistical method was further developed to estimate transcript expression level, as well as the RNA degradation rate, for individual genes and their isoforms. We showed that our method can improve the accuracy of transcript isoform expression estimation. The RNA degradation rate of individual transcript we estimated is consistent across samples and/or experiments/platforms. In addition, the RNA degradation rate from our model is independent of the RNA length, consistent with previous studies on RNA decay rate.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Notation of the RNA degradation model for genes with multiple isoforms.
Fig. 2.
Fig. 2.
The RNA degradation model for RNA-Seq. (a) RNA degradation for the gene ACTB. The number of mapped short-reads for each exon divided by exon length decreases exponentially, as the distance of the exon to the 3' end of the transcript increases; the circles represent the exons, and the solid line is the linear regression result. The crosses show the location of the exons we filtered out. (b) Histogram of the R2 of the linear regressions on the 1820 genes. The dashed line shows the median of the R2s. (c) Histogram of the estimated αgs of the 945 genes with positive αg and R2 ≥ 0.7. The dashed line shows the median of the estimated αgs. (d) The relationship between the value of αg and transcript length. The gray circles are from 945 genes as in (c); the curve is estimated based on local regression by the loess method on the 945 genes. The loess regression was performed by the R function “loess” with the default setting. All plots and results are based on Data set I.
Fig. 3.
Fig. 3.
The consistency of RNA degradation rate. (a) The estimated values of αg of the 625 common genes based on the liver and kidney samples of Data set I. (b) The estimated values of αg of the 603 common genes based on the liver samples from Data sets I and II. The solid lines are the linear regression lines between the 2 samples in each plot. The ρ stands for Pearson correlation coefficient.
Fig. 4.
Fig. 4.
Simulation results of MIRR under the situation that only one isoform is expressed within each gene (m = 1). Each plot shows the MIRRs on the genes with n isoforms (n = 2,3,4,5); the gene number is shown in the parentheses of the title in each plot. In each plot, we compare our RNA degradation-based method (RD, shown in solid lines) with the uniform assumption-based method (UN, shown in dotted lines) by testing the simulated data with different combinations of parameters: αg = 1, 3, 5, 7, 9 (shown with dots from “1” to “9”), ϕ = 1,2,…,10, and c = 1,5,10 (row).
Fig. 5.
Fig. 5.
Simulation results of DS under the situation that only one isoform is expressed within each gene (m = 1). Each plot shows the averaged DSs on the genes with n isoforms (n = 2,3,4,5); the gene number is shown in the parentheses of the title in each plot. In each plot, we compare our RNA degradation-based method (RD, shown in solid lines) with the uniform assumption-based method (UN, shown in dotted lines) by testing the simulated data with different combinations of parameters: αg = 1, 3, 5, 7, 9 (shown with dots from “1” to “9”), ϕ = 1,2,…,10, and c = 1,5,10 (row).

Similar articles

Cited by

References

    1. Archer KJ, Dumur CI, Joel SE, Ramakrishnan V. Assessing quality of hybridized RNA in Affymetrix Genechip experiments using mixed-effects models. Biostatistics. 2006;7:198–212. - PubMed
    1. Chen J, Sun M, Kent WJ, Huang X, Xie H, Wang W, Zhou G, Shi RZ, Rowley JD. Over 20% of human transcripts might form sense-antisense pairs. Nucleic Acids Research. 2004;32:4812–4820. - PMC - PubMed
    1. Feng J, Li W, Jiang T. Inference of isoforms from short sequence reads. Journal of Computational Biology. 2011;18:305–321. - PMC - PubMed
    1. Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research. 2010;38:e131. - PMC - PubMed
    1. Hawkins RD, Hon GC, Ren B. Next-generation genomics: an integrative approach. Nature Reviews. Genetics. 2010;11:476–486. - PMC - PubMed

Publication types