Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 31 (17), 2778-84

Polyester: Simulating RNA-seq Datasets With Differential Transcript Expression

Affiliations

Polyester: Simulating RNA-seq Datasets With Differential Transcript Expression

Alyssa C Frazee et al. Bioinformatics.

Abstract

Motivation: Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data.

Results: Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user.

Availability and implementation: Polyester is freely available from Bioconductor (http://bioconductor.org/).

Contact: jtleek@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Figures

Fig. 1.
Fig. 1.
Fragment length distributions available in Polyester. The skewed curve shows the fragment length distribution for selected sequencing reads from the GEUVADIS RNA-seq dataset; the other curve shows a normal distribution with mean 250 and SD 25. These two fragment length models are built into the simulator; users can also supply their own
Fig. 2.
Fig. 2.
Example error model available in Polyester. Empirical error model derived from TruSeq SBS Kit v5-GA chemistry, using Illumina Genome Analyzer IIx, for mate 1 of a paired-end read. Separate panels are shown for each possible true reference nucleotide. Each panel illustrates the probability (y axis) of mis-sequencing that reference nucleotide in a given read position (x axis) as any of the three other nucleotides, or as an ‘N’ (indicating an ‘unknown’ nucleotide in the read). As expected, error probabilities increase toward the end of the read. Other error models, including the model for mate 2 of the read on this protocol, are illustrated in Supplementary Figures S3–S7. If these error models are not suitable, custom error models can be estimated from any set of aligned sequencing reads
Fig. 3.
Fig. 3.
Coverage comparison to GEUVADIS dataset. We counted the number of reads estimated to have originated from each of these annotated transcripts from gene CD83 (bottom half of figure) in the GEUVADIS RNA-seq dataset, then simulated that same number of reads from each transcript using Polyester and processed those simulated reads. This figure shows the coverage track (y-axis, indicating number of reads with alignments overlapping the specified genomic position) for sample NA06985, reads simulated without positional bias, and read simulated using the rnaf bias model. While the simulated coverage tracks look a bit cleaner than the track from the GEUVADIS dataset, many of the major within-exon coverage patterns are captured in the simulation, especially with the uniform model. For example, both simulations capture the peak at the beginning of the rightmost exon. Note: the dotted line indicates that part of a long intron at that location was not illustrated in this plot
Fig. 4.
Fig. 4.
ROC curves for transcript-level differential expression calls from Polyester datasets. For varying significance (P- or q-value) cutoffs, sensitivity and specificity from the simulation experiments. Differential expression was more difficult to detect under conditions where expression levels were highly variable between replicates, as expected
Fig. 5.
Fig. 5.
Coefficient distributions from differential expression models. Distributions from the high-variance scenario are shown in (a) and from the low-variance scenario are shown in (b). These distributions of estimated log fold changes between the two simulation groups tend to be centered around the values specified at the beginning of the simulation, and there is more variability in the coefficient estimates for high-variance scenario, as expected

Similar articles

See all similar articles

Cited by 48 articles

See all "Cited by" articles

Publication types

Feedback