Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 May 18;8(5):e1000371.
doi: 10.1371/journal.pbio.1000371.

Most "Dark Matter" Transcripts Are Associated With Known Genes

Free PMC article

Most "Dark Matter" Transcripts Are Associated With Known Genes

Harm van Bakel et al. PLoS Biol. .
Free PMC article


A series of reports over the last few years have indicated that a much larger portion of the mammalian genome is transcribed than can be accounted for by currently annotated genes, but the quantity and nature of these additional transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the quantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq identifies many fewer transcribed regions ("seqfrags") outside known exons and ncRNAs. Most nonexonic seqfrags are in introns, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority of intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sites identified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads that display characteristics of random sampling from a low-level background or several thousand small transcripts (median length = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions with open chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance is generally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.

Conflict of interest statement

The authors have declared that no competing interests exist.


Figure 1
Figure 1. Low precision for tiling arrays compared to RNA-Seq data.
(A) Precision-recall curves for detection of exons in human RefSeq gene annotations on tiling arrays. Transcribed genomic regions (transfrags) were selected based on a range of parameters that were applied before or after median smoothing with a bandwidth of 70 bp: max gap, the maximum distance between two positive probes; min run, the minimum size of a transcribed region. The log2 normalized intensity threshold used to select positive probes was varied between −1 and 2 to plot each line. (B) Precision-recall curves for the combined RNA-Seq data from three human brain samples, at different read depths (0.2 to 2.1 Gb). Transcribed regions (seqfrags) were identified on the basis of uniquely mapped reads, and the threshold for the minimal read count per seqfrag was varied between 1 and 100 to plot each line. (C) Comparison of RNA-Seq read counts and tiling array probe intensities for the pooled set of human brain RNA-Seq reads (three samples). The number of RNA-Seq reads overlapping each mapped probe coordinate was determined and used to draw a boxplot of the intensity distributions measured for probes overlapped by varying numbers of RNA-Seq reads, as indicated (gray boxes). The intensity distribution across all probes is shown in comparison (white box). Line graphs indicating the cumulative fraction of RNA-Seq read area (green) and read count (red) covered at each read coverage level are superimposed on the barplot, with the scale shown on the right. (D) Kernel-density plot of probe intensities for high- and low-coverage probe groups from (A), as indicated.
Figure 2
Figure 2. RNA-Seq read mapping overview.
(A) Proportion of reads with a unique match in the genome mapping to known genes, mRNAs, and spliced ESTs. Reads were pooled across all human or mouse RNA-Seq samples and sequentially matched against a non-redundant set of known genes, mRNA, and spliced EST data. Any remaining reads were classified as “other.” (B) Same as in (A) but considering the total amount of transcribed genomic area, rather than read count. (C) The relationship between the RNA-Seq read depth and the transcribed area in the genome for human brain RNA-Seq reads, based on 50.2 million reads pooled from the three independent samples that were assayed separately. The total transcribed area is indicated for all reads, as well as those that map to known exons, known introns, and intergenic regions. (D) Extrapolation of transcribed genomic area at increasing read depths, based on the distribution of all reads in (C). The model fitted on the uniquely mapped reads is shown in the inset. (E, F) Cumulative fraction of seqfrags as a function of the number of reads mapped to each seqfrags in the combined set of human and mouse samples, respectively.
Figure 3
Figure 3. Intergenic expression is positionally biased towards known genes.
(A) Relative enrichment of RNA-Seq read frequency in intergenic regions as a function of the distance to 5′ and 3′ ends of annotated genes in the human (red) and mouse genomes (green). The distribution in genomic DNA-Seq reads from HeLa cells is shown as a control (gray). All intergenic regions in the human and mouse genomes were aligned relative to the annotated transcription start (TSS) or termination (TTS) sites of flanking genes. The robust average number of reads per 10 million uniquely mapped reads across all samples was then determined in 1 kb segments (RPKB) from the TSS or TTS, up to a distance of 30 kb, and the relative enrichment ratio in each segment was calculated by dividing by the median RPKB at distances more than 30 kb away from genes (baseline). Robust averages were calculated after removing the top 0.5% outliers, to avoid very highly expressed regions from having a disproportionate effect. (B) Same plots as in (A) for the combined reads from total RNA samples taken from human brain tissue and a universal human reference sample , uniquely mapped to the sense (blue) or antisense strand (yellow) relative to the neighboring gene region. (C) Histogram showing the distribution of correlation coefficients (red) between the read coverage in intergenic seqfrags and the nearest neighboring gene, across 11 human RNA-Seq samples. Read coverage was calculated as the number of reads per base per 10 million RNA-Seq reads across seqfrags and exonic regions of neighboring genes. Correlation coefficients were only calculated if the number of reads mapping to seqfrags and neighboring genes was greater than 10 in at least five out of eleven samples. The background distribution of correlation coefficients between seqfrags and randomly selected genes that met these thresholds is shown in comparison (gray). (D) Boxplot showing the correlation between the read coverage of intergenic transcripts and closest neighboring genes (red) or random genes (gray) across 11 human RNA-Seq tissue samples, as a function of their distance. (E) Representative example of intergenic transcription directly adjacent to the 3′ end of FAM114A1. The region with significant correlation is indicated by a red box. Mapped read coverage for the PolyA+ (black) and total RNA (blue) samples was standardized on a sequencing depth of 10 million reads and plotted in graphs scaled from 1- to 25-fold coverage.
Figure 4
Figure 4. Evidence for specific expression in intergenic regions.
Rootograms of the distribution of the number of the total number of RNA-Seq reads per kb of trimmed intergenic sequence for the combined (A) human PolyA+, (B) mouse PolyA+, and (C) human total RNA sequence data (gray bars), in comparison to the expected random distribution for the same number of reads (red lines). Ten kb intergenic regions flanking known gene annotations were excluded from the analysis. (D, E, and F) Same as (A), (B), and (C), but considering only intergenic transcribed regions with single-read coverage (singletons). The derived random distribution was adjusted accordingly.
Figure 5
Figure 5. Seqfrags with read counts above background are conserved at the sequence level.
Distribution of maximum PhastCons conservation score measured across seqfrags mapping to trimmed intergenic regions in the pooled (A) human and (B) mouse RNA-Seq samples as a function of read coverage (red). PhastCons scores were obtained from the UCSC genome browser and reflect the degree of conservation in multiple alignments of the human and mouse genomes with 18 and 20 other mammalian species, respectively. Conservation scores obtained from a random shuffling of seqfrag positions within trimmed intergenic regions are shown in gray for comparison. (C and D) Bar plots indicating the PhastCons score distribution for seqfrags mapping to different genomic regions. The bars are color-coded according to the class of seqfrags (legend), with the score distribution for randomly mapped seqfrags shown in gray.
Figure 6
Figure 6. Conservation and usage of human TU exons.
(A) The distribution of PhastCons scores for novel exons in each category as in (A) (darker bars), compared to the distribution of scores from the same set of exons after random reshuffling their positions in the genome (lighter bars). (B) Plot of the ratios between the read coverage of novel exons (calculated in RPB) and the genes they are associated with, either by overlap (sense or antisense) or as additions to known gene structures (5′ end, 3′ end, and internal). The ratios for predicted exons overlapping exons of known gene structures are shown in comparison.
Figure 7
Figure 7. Examples of identified TUs.
(A) Evidence for the presence of an alternative promoter at the human SLC41A1 gene. Splice junctions connecting to the alternative promoter region are indicated in red. Mapped RNA-Seq data for the UHR paired-end (PE) read sample is shown for reference (black). The PhastCons conservation track scores were based on multiple alignments of 28 vertebrates. (B) Protein-coding TU detected in an intergenic region on chromosome 17, with high similarity to the elongation factor Tu GTP binding domain. The two additional upstream transcribed regions may be part of the same transcript, though no junction sequences were detected. (C) Intergenic TU (red) detected on chromosome 15 based on junctions in the PE brain, PE UHR, and SE testes RNA-Seq samples.
Figure 8
Figure 8. Most intergenic transcripts are unspliced and associated with open chromatin.
(A) Relationship between read count and the fraction of seqfrags with at least one identified junction sequence for seqfrags in exonic (gray) or trimmed intergenic (red) regions. (B) Cluster of ubiquitously expressed seqfrags derived from uniquely mapped reads on chromosome 15. An additional track with multireads from SE testes RNA-Seq data (blue) shows that many of the uniquely mapped seqfrags are part of a larger, continuously transcribed region. (C) Digital DNase I hypersensitivity profiles in RA-differentiated SK-N-SH cells for 11,416 seqfrags (red) and 5,819 seqfrag clusters (green) expressed in human brain. Hypersensitivity is shown as the average density of in vivo cleavage fragment reads per kb (RPKB, normalized to 20 million reads) across all seqfrags or clusters, measured in 100 bp windows flanking the center position of each seqfrag or cluster up to a distance of 2 kb. The DNase I hypersensitivity at random positions in intergenic regions is shown as a control (gray). The box-and-whisker plots at the bottom of the graph indicate the median (box) and the 95th percentile (whiskers) of the seqfrag- (red) and seqfrag cluster size range (green).

Comment in

Similar articles

See all similar articles

Cited by 217 articles

See all "Cited by" articles


    1. Kapranov P, Cawley S. E, Drenkow J, Bekiranov S, Strausberg R. L, et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002;296:916–919. - PubMed
    1. Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S, et al. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 2004;14:331–342. - PMC - PubMed
    1. Schadt E. E, Edwards S. W, GuhaThakurta D, Holder D, Ying L, et al. A comprehensive transcript index of the human genome generated using microarrays and computational approaches. Genome Biol. 2004;5:R73. - PMC - PubMed
    1. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005;308:1149–1154. - PubMed
    1. Kapranov P, Cheng J, Dike S, Nix D. A, Duttagupta R, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007;316:1484–1488. - PubMed

Publication types

LinkOut - more resources