Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 29 (7), 644-52

Full-length Transcriptome Assembly From RNA-Seq Data Without a Reference Genome

Affiliations

Full-length Transcriptome Assembly From RNA-Seq Data Without a Reference Genome

Manfred G Grabherr et al. Nat Biotechnol.

Abstract

Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1. Overview of Trinity
(a) Inchworm assembles the read data set (short black line, top) by greedily searching for paths in a k-mer graph (middle), resulting in a collection of linear contigs (color lines, bottom), with each k-mer present only once in the contigs. (b) Chrysalis pools contigs if they share at least one k-1-mer and reads span the join, and builds individual de Bruijn graphs from each pool (colored lines). (c) Butterfly takes each de Bruijn graph from Chrysalis (top), and trims spurious edges and compacts linear paths (middle). It then reconciles the graph with reads (dashed colored arrows, bottom) and pairs (not shown), and outputs one linear sequence for each splice form and/or paralogous transcript reflected in the graph (bottom, colored sequences).
Figure 2
Figure 2. Trinity correctly reconstructs the majority of full-length transcripts in fission yeast and mouse
(a,c) Shown is the fraction of Oracle genes fully reconstructed in different expression quintiles (5% increments) in fission yeast (50M pairs assembly) (a) and the fraction of Oracle genes with at least one transcript fully reconstructed in different expression quintiles in mouse (53M pairs assembly) (c). Each bar represents a 5% quintile of read coverage for genes expressed. Bar height is the fraction of annotated genes in that quintile and among the Oracle set (grey) or the subset of the Oracle set that are fully reconstructed by Trinity (blue). For example, ~36% of the S. pombe transcripts at the bottom 5% of expression levels are fully reconstructed by Trinity; ~45% of the transcripts in this quintile are in the Oracle set. (b, d) Shown are the median values for coverage by length of reference transcripts by the longest corresponding Trinity-assembled transcript, according to expression quintiles in yeast (b) and mouse (d), depending on the number of read pairs that went into each assembly.
Figure 3
Figure 3. Trinity improves the yeast annotation
Shown are examples of Trinity assemblies (red) along with the corresponding annotated transcripts (blue) and underlying reads (grey) all aligned to the S. pombe genome (read alignment shown for graphical clarity; no alignments were used to generate the assemblies). (a) Trinity identifies a new multi-exonic transcript (left) and extends the 5′ and 3′ UTRs of the Coq9 gene (right). (b) Trinity extends the UTRs of two convergently transcribed and overlapping genes.
Figure 4
Figure 4. Trinity resolves closely paralogous genes
(a) Shown is the compacted component graph for two paralogous mouse genes, Ddx19a and Ddx19b (93% identity), highlighting the two paths (red and blue) chosen by Trinity out of the 64 possible paths in this portion alone. (b) Shown are the alignments between the transcripts represented by the red and blue paths in (a) and the paralogous genes Ddx19a and Ddx19b relative to the mouse reference genome (genome alignment shown for graphical clarity only; no alignments were used to generate the assemblies).
Figure 5
Figure 5. Comparison of Trinity to other mapping-first and assembly-first methods
(a,b) Evaluation based on number of full-length annotated transcripts reconstructed by each method in in S. pombe (50M read pair assemblies) (a) and mouse (53M read pair assemblies) (b). Shown is the number of genes reconstructed in full length (blue) or as fusions of two full-length genes (green, yeast only) and the number of full length reconstructed transcript isoforms (red, mouse only) in each of four ‘assembly first’ (de novo) and two ‘mapping first’ approaches. (c,d) Evaluation based on the number of introns defined by the transcripts from each method for S. pombe (c) and mouse (d). Shown is the number of distinct introns consistent with the reference annotation (y axis) versus the number of uniquely predicted introns (x axis), based on mapping to the genome of the transcripts reconstructed by each of Trinity (red), Trans-ABySS (yellow), ABySS (blue), SOAPdenovo (green), Scripture (purple) and Cufflinks (grey). (e,f) Evaluation based on the number of splicing patterns (complete sets of introns in multi-intronic transcripts) defined by the transcripts from each method for S. pombe (e) and mouse (f). Shown are the numbers of distinct splicing patterns (y axis) consistent with the reference annotation versus the number of unique splicing patterns (x axis), for each method (methods are colored as above).
Figure 6
Figure 6. Trinity reconstructs polymorphic transcripts in whitefly
(a) Allelic variation evident from mapping RNA-Seq reads to a Trinity-reconstructed full-length whitefly transcript. Top: Shown is a single transcript (top, red bar), orthologous to the D. melanogaster Lamin gene, determined by grouping of allelic variant transcripts generated by Trinity. SNPs: yellow bars, Middle: Cummulative read coverage along the transcripts; colored bars: SNPs; bar height: relative proportions of SNP variants. Blue: C, red: T, orange: G, green: A. Bottom: Individual read coverage. (b) Example of two alternatively spliced transcripts resolved even in the absence of a reference genome. Top: Shown are two isoforms of an ELAV-like gene (top) reconstructed by Trinity (grey boxes, alternative exons). Exon structure is determined for visualization by the D. melanogaster ortholog. Bottom: shown is the protein sequence alignment of the two whitefly isoforms to orthologous proteins from other insects, confirming the splice variants (grey boxes). (c) Comparison of performance in de novo assembly of the whitefly transcriptome. For each of the methods, shown is the number of unique top-matching (blastx) uniref90 protein sequences aligned across the corresponding minimum percent protein length value at >= 80% (blue), >= 90% (green), >= 95% (orange) and 100% (red).

Comment in

Similar articles

See all similar articles

Cited by 4,741 PubMed Central articles

See all "Cited by" articles

References

    1. Birol I, et al. De novo transcriptome assembly with ABySS. Bioinformatics. 2009;25:2872–2877. - PubMed
    1. Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–515. - PMC - PubMed
    1. Guttman M, et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol. 2010;28:503–510. - PMC - PubMed
    1. Haas BJ, Zody MC. Advancing RNA-Seq analysis. Nat Biotechnol. 2010;28:421–423. - PubMed
    1. Yassour M, et al. Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proc Natl Acad Sci USA. 2009;106:3264–3269. - PMC - PubMed

Publication types

Associated data

LinkOut - more resources

Feedback