Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 8 (8), 1494-512

De Novo Transcript Sequence Reconstruction From RNA-seq Using the Trinity Platform for Reference Generation and Analysis

Affiliations

De Novo Transcript Sequence Reconstruction From RNA-seq Using the Trinity Platform for Reference Generation and Analysis

Brian J Haas et al. Nat Protoc.

Abstract

De novo assembly of RNA-seq data enables researchers to study transcriptomes without the need for a genome sequence; this approach can be usefully applied, for instance, in research on 'non-model organisms' of ecological and evolutionary importance, cancer samples or the microbiome. In this protocol we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-seq data in non-model organisms. We also present Trinity-supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples and approaches to identify protein-coding genes. In the procedure, we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sourceforge.net. The run time of this protocol is highly dependent on the size and complexity of data to be analyzed. The example data set analyzed in the procedure detailed herein can be processed in less than 5 h.

Figures

Figure 1
Figure 1. Overview of Trinity assembly and analysis pipeline
Shown are the key sequential steps in Trinity (left) and the associated compute resources (right). Trinity takes as input short reads (top left) and first uses the Inchworm module to construct contigs. This requires a single high-memory server (~1G RAM per 1M paired reads, but varies based on read complexity; top right). Chrysalis (middle left) clusters related Inchworm contigs, often generating tens to hundreds of thousands of Inchworm contig clusters, each of which is processed to a de Bruijn graph component independently and in parallel on a computing grid (bottom right). Butterfly (bottom left) then extracts all probable sequences from each graph component, which can be parallelized as well.
Figure 2
Figure 2. De novo transcriptome assembly and analysis workflow
Reads from multiple samples (e.g., different tissues, top) are combined into a single data set. Reads may be optionally normalized to reduce read counts while retaining read diversity and sample complexity. The combined read set is assembled by Trinity to generate a ‘reference’ de novo transcriptome assembly (right). Protein coding regions can be extracted from the reference assembly using TransDecoder and further characterized according to likely functions based on sequence homology or domain content. Separately, sample-specific expression analysis is performed by aligning the original sample reads to the reference transcriptome assembly on a per sample basis, followed by abundance estimation using RSEM. Differentially expressed transcripts are identified by applying Bioconductor software, such as edgeR, to a matrix containing the RSEM abundance estimates (number of RNA-Seq fragments mapped to each transcript from each sample). Differentially expressed transcripts can then be further grouped according to their expression patterns.
Figure 3
Figure 3. Abundance estimation via Expectation Maximization by RSEM
Shown is an illustrative example of abundance estimation for two transcripts with shared (blue) and unique (red, yellow) sequences. To estimate transcript abundances, RNA-Seq reads (short bars) are first aligned to the transcript sequences (long bars, bottom). Unique regions of isoforms will capture uniquely-mapping RNA-Seq reads (red and yellow short bars), and shared sequences between isoforms will capture multiply-mapping reads (blue short bars). An Expectation Maximization algorithm, implemented in the RSEM software, estimates the most likely relative abundances of the transcripts and then fractionally assigns reads to the isoforms based on these abundances. The assignments of reads to isoforms resulting from iterations of expectation maximization are illustrated as filled short bars (right), and those assignments eliminated are shown as hollow. Note that assignments of multiply-mapped reads are in fact performed fractionally according to a maximum likelihood estimate. Thus, in this example, a higher fraction of each read is assigned to the more highly expressed top isoform than to the bottom isoform.
Figure 4
Figure 4. Effects of in silico fragment normalization of RNA-Seq data on Trinity full-length transcript reconstruction
Shown are the number of full-length transcripts reconstructed (Y axis) from a dataset of paired-end strand-specific RNA-Seq in S. pombe (a, 10M paired-end reads) or mouse (b, 100M, paired-end reads), using either the full dataset (Total; 100%) or different samplings (X axis) by either Trinity’s in silico normalization procedure at 5X up to 100X targeted maximum k-mer (k=25) coverage (blue bars) or by random down-sampling of the same number of reads (red bars).
Figure 5
Figure 5. Transcriptome and genome representations of alternatively spliced transcripts
Shown is an example of the graphical representation generated by Trinity’s Butterfly software (a) along with the corresponding reconstructed transcripts (b) and their exonic structure based on the alignment to the mouse genome (c). Each node in the graph (a) is associated with a sequence, and directed edges connect consecutive sequences from 5′ to 3′ in the same transcript. Bulges (bifurcations) indicate sequence differences between alternative reconstructed transcripts, including alternatively spliced cassette exons; only a single bulge is shown in this transcript graph. Edges are annotated by the number of RNA-Seq fragments supporting the transcript from the 5′ sequence to the 3′ one. In this example, there are two supported paths: one from the blue to the green node (supported by 32 fragments) yielding ‘isoform A’ (b, top), and the other from the blue to the red to the green node, supported by at most 5 fragments, yielding ‘isoform B’ (b, bottom). The red node is a result of an alternatively skipped exon, as apparent in the gene structure (c, red bar, shown in ‘isoform B’). Navigable transcript graphs are optionally generated by Butterfly, provided in ‘dot’ format and can be visualized using graphviz (http://graphviz.org). These details are provided on the Trinity website (http://trinityrnaseq.sourceforge.net/advanced_trinity_guide.html).
Figure 6
Figure 6. Strand-specific library types
The left (/1) and right (/2) sequencing reads are depicted according to their orientations relative to the sense strand of a transcript sequence. The strand-specific library type (F, R, FR, or RF) depends on the library construction protocol and is user-specified to Trinity via the ‘--SS_lib_type’ parameter.
Figure 7
Figure 7. Full-length transcript reconstruction by Trinity in different organisms, sequencing depths, and parameters
Shown is the number of fully reconstructed transcripts (bars, left Y axis) for Trinity assemblies of RNA-Seq data derived from fission yeast (Schizosaccharomyces pombe, ), Drosophila melanogaster, and mouse with different combinations of parameters: DS – double stranded mode, SS – strand-specific mode, +J – using the ‘--jaccard_clip’ parameter to split falsely fused transcripts. Both SS and DS results are provided for S. pombe and mouse, but only DS results are provided for Drosophila since its RNA-Seq data was not strand-specific. Blue: full-length transcripts; red: full length merged, i.e., transcripts erroneously fused with another (typically neighboring) transcript. The black curve (right Y axis) indicates the run times in each case with a contemporary high-memory (256G to 512G RAM) server using a maximum of 4 threads (‘--CPU 4’, see Tutorial).
Figure 8
Figure 8. Evaluating paired-read support via the Jaccard similarity coefficient
Read pair support is computed by first counting the number of RNA-Seq fragments (bounds of paired reads) that span each of two outer points of a specified window length (default: 100 bases), and then computing the Jaccard similarity coefficient (intersection/union) comparing the fragments that overlap either point. An example is shown for a neighboring pair of S. pombe transcripts (SPAC23C4.14 and SPAC23C4.15, bottom) that have substantial overlapping read coverage (gray track), resulting in a contiguous (fused) transcript assembled by Inchworm. However, the Jaccard similarity coefficient (blue track) calculated from the paired-reads (grey dumbbells) clearly identifies the position of reduced pair support. Examples of strong (upper left) and weak (upper right) pair support are depicted at top. When using the ‘--jaccard_clip’ parameter, the Inchworm contig is dissected to two separate full-length transcripts, which are then further processed by Chrysalis and Butterfly as part of the Trinity pipeline.
Figure 9
Figure 9. Pairwise comparisons of transcript abundance
Shown are two visualizations for comparing transcript expression profiles between the logarithmic growth and plateau growth samples from S. pombe. (a) MA-plot for differential expression analysis generated by EdgeR, plots for each gene its log2(fold change) between the two samples (A, Y axis) vs. its log2(average expression) in the two samples (M, X axis). (b) Volcano plot comparing false discovery rate (-log10FDR, Y axis) as a function of log2(fold-change) between the samples (logFC, X axis). Transcripts that are identified as significantly differentially expressed at most 0.1% FDR are colored red.
Figure 10
Figure 10. Comparisons of transcriptional profiles across samples
(a) Hierarchical clustering of transcripts and samples. Shown is a heatmap showing the relative expression levels of each transcript (rows) in each sample (column). Rows and columns are hierarchically clustered. Expression values (FPKM) are log2 transformed and then median-centered by transcript. (b) Heatmap showing the hierarchically clustered Spearman correlation matrix resulting from comparing the transcript expression values (TMM-normalized FPKM) for each pair of samples. (c) Transcript clusters, extracted from the hierarchical clustering using R. X axis: samples (DS: diauxic shift; HS: heat shock; Log: mid-log growth; Plat: plateau growth); Y axis: median-centered log2(FPKM). Grey lines: individual transcripts; Blue line: average expression values per cluster. Number of transcripts in each cluster is shown in a left corner of each plot.

Similar articles

See all similar articles

Cited by 1,767 PubMed Central articles

See all "Cited by" articles

Publication types

MeSH terms

LinkOut - more resources

Feedback