Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Mar 9;543(7644):199-204.
doi: 10.1038/nature21374. Epub 2017 Mar 1.

An Atlas of Human Long Non-Coding RNAs With Accurate 5' Ends

Free PMC article

An Atlas of Human Long Non-Coding RNAs With Accurate 5' Ends

Chung-Chau Hon et al. Nature. .
Free PMC article


Long non-coding RNAs (lncRNAs) are largely heterogeneous and functionally uncharacterized. Here, using FANTOM5 cap analysis of gene expression (CAGE) data, we integrate multiple transcript collections to generate a comprehensive atlas of 27,919 human lncRNA genes with high-confidence 5' ends and expression profiles across 1,829 samples from the major human primary cell types and tissues. Genomic and epigenomic classification of these lncRNAs reveals that most intergenic lncRNAs originate from enhancers rather than from promoters. Incorporating genetic and expression data, we show that lncRNAs overlapping trait-associated single nucleotide polymorphisms are specifically expressed in cell types relevant to the traits, implicating these lncRNAs in multiple diseases. We further demonstrate that lncRNAs overlapping expression quantitative trait loci (eQTL)-associated single nucleotide polymorphisms of messenger RNAs are co-expressed with the corresponding messenger RNAs, suggesting their potential roles in transcriptional regulation. Combining these findings with conservation data, we identify 19,175 potentially functional lncRNAs in the human genome.

Conflict of interest statement

The authors declare no competing financial interests. Readers are welcome to comment on the online version of the paper.


Extended Data Figure 1 |
Extended Data Figure 1 |. Building a 5′ complete lncRNA catalogue.
a, Integration of CAGE and transcript models. CAGE clusters were used to integrate transcript models from various sources and their 5′ completeness was assessed on the basis of TIEScore. b, Identification of lncRNAs. TIEScore identified 59,110 genes and coding potential assessment further identified 27,919 lncRNAs in FANTOM CAT at the robust TIEScore cutoff. c, Categorization of lncRNAs. LncRNAs were annotated according to their gene orientation (that is, genomic context) and DHS type (that is, epigenomic context) and then categorized into divergent p-lncRNAs (purple), intergenic p-lncRNAs (blue), e-lncRNAs (green) and other lncRNAs (grey). d, Overlaps between FANTOM CAT and other lncRNA catalogues. e, LncRNA gene models outside FANTOM CAT are 5′ incomplete. LncRNAs found commonly in both catalogues (grey), or only in FANTOM CAT (red), show stronger evidence of transcription initiation (DHS, H3K4me1, H3K4me3 and PolII ChIP-seq) and conservation (phastCons) than those found only in other lncRNA catalogues (blue, green or yellow).
Extended Data Figure 2 |
Extended Data Figure 2 |. FANTOM CAT is more 5′ complete than other lncRNA catalogues.
a, FANTOM CAT lncRNA TSS are well-supported. The 5′ ends of FANTOM CAT lncRNAs (first column) have stronger transcriptomic, epigenomic and genomic evidence of transcription initiation than the 5′ ends of lncRNA models in the Human BodyMap 2.0 (ref. 4), miTranscriptome and GENCODE release 25 (ref. 19) (second column). In b and c, the box plots show the median, quartiles and Tukey whiskers of the estimates of FDR of complete 5′ ends (b) and number of 5′ complete lncRNA genes (c) on the basis of ten sets of gold standard TSS and non-TSS regions (Methods). b, FDR of complete 5′ ends. c, Estimated number of 5′ complete lncRNA genes (total number of genes × [1 − FDR]). d, Validation rate of gene models using RAMPAGE. RAMPAGE data sets, (n = 207, Methods) were used to validate the lncRNA transcripts in FANTOM CAT and other catalogues (left). Transcripts containing full consensus CDS (CCDS transcripts) were used for control (right). The exon of a transcript is detected by RAMPAGE if it overlaps ≥3 RAMPAGE 3′ ends. Transcript detection rates of all catalogues were plotted (upper). About 95% of lncRNA transcripts in the robust FANTOM CAT can be detected, which is slightly higher than that of GENCODE release 25 (~92%). The TSS of a detected transcript is validated by RAMPAGE if it is located within the proximity of a RAMPAGE 5′ end (for example, from 0 to 500 bp, x axis, lower). At 100 bp, ~95% of lncRNA transcripts in the robust FANTOM CAT can be validated, versus ~85% for that of GENCODE release 25. We note the percentages of CCDS transcripts in FANTOM CAT and GENCODE release 25 detected or validated by RAMPAGE are similar, with the robust and stringent FANTOM CAT catalogues performing slightly better.
Extended Data Figure 3 |
Extended Data Figure 3 |. Revision of lncRNA models in GENCODE.
a, An example of improved TSS annotation of a GENCODE release 25 lncRNA gene. The 5′ ends of GENCODE release 25 annotated lncRNA transcripts of TUG1 (ENSG00000253352) are distant from the region of strong CAGE signal, while FANTOM CAT added extra transcripts accurately start from the proximal CAGE signal summit. b, An example of bridged gene models of GENCODE release 25 lncRNA genes. In GENCODE release 25, the locus was annotated with three short lncRNA genes; FANTOM CAT bridged these short lncRNA transcript models into a long transcript model (RP11–973H7.4, ENSG00000267654) starting from the proximal CAGE signal summit.
Extended Data Figure 4 |
Extended Data Figure 4 |. Heterogeneity among lncRNA gene categories.
a, Epigenomic features surrounding TSS. The y axis refers to the fraction of TIR overlaps with peaks of the corresponding epigenomic signal from the Roadmap Epigenome Consortium. b, Genomic features surrounding TSS. Sequence features conducive to generating longer transcripts are enrichment of 5′ splice site (5′ SS) and depletion of polyadenylation sites (PAS). Sequence features associated with transcription initiation include CpG islands, INR (initiator) motif and TATA box motif. c, Core promoter motifs. Grey dashed lines indicate whole-genome background.
Extended Data Figure 5 |
Extended Data Figure 5 |. Transposons at TIRs.
a, Percentages of genes with conserved and unconserved TIR (as defined in Fig. 1c) and their overlap with various classes of transposons. b, Enrichment of retrotransposons at unconserved TIR. The Venn diagrams show the overlap between unconserved TIR, DNA transposons and retrotransposons. Retrotransposons are significantly enriched in unconserved TIR of all gene classes (one-tailed Fisher’s exact test, P < 0.05).
Extended Data Figure 6 |
Extended Data Figure 6 |. Expression landscape of lncRNAs in primary cells.
a, Expression level and specificity. Abbreviation cpm is relative log expression (rle) normalized count per millions. The maximum expression level (log2 cpm) and expression specificity (Chao-Shen’s corrected Shannon entropy) of genes among 69 primary cell facets were plotted. Box plots show the median (dashed lines), quartiles and Tukey whiskers. b, Percentage of genes within categories expressed within primary cell facets. The circles represent the mean among samples within a facet and the error bars represent 99.99% confidence intervals. Dashed lines represent the means among all samples. c, Number of lncRNA genes expressed within primary cell facets. Dashed line represents the mean among all samples. The x axis is sorted on the basis of number of lncRNA genes expressed. A gene is considered as ‘expressed’ when cpm ≥ 0.01.
Extended Data Figure 7 |
Extended Data Figure 7 |. Association of cell-type-enriched genes with trait-associated genes of different biological themes.
A detailed view of blocks from Fig. 2a. The dendrograms were coloured as in Fig. 2a. a, ‘Immune system’ cell types and ‘infection and immunity’ traits. b, ‘Hepato-intestinal system’ cell types and ‘hepatic function’ traits. c, ‘Pigmented cells’ cell types and ‘pigmentation’ traits. d, ‘Non-immune blood cells’ cell types and ‘blood homeostasis’ traits. e, ‘Cardiovascular system’ cell types and ‘cardiovascular function’ traits.
Extended Data Figure 8 |
Extended Data Figure 8 |. LncRNA AP001057.1 is associated with classical monocytes and implicated in immune diseases.
a, Genomic view of AP001057.1 (ENSG00000232124) in the ZENBU genome browser. The strongest TSS of AP001057.1 overlaps with an enhancer DHS. The locus overlaps with fine-mapped SNPs associated with Crohn’s disease and GWAS SNPs associated with coeliac disease and inflammatory bowel disease. b, AP001057.1 is associated with classical monocytes (CL:0000860). c, AP001057.1 is significantly upregulated in monocytes upon stimulation with various immunogenic agents (FDR < 0.05 in edgeR, highlighted in red and indicated with asterisks). Note: we performed differential expression analysis to identify lncRNAs that are dynamically regulated upon stimulation, infection or differentiation on the basis of 25 manually curated series of FANTOM5 samples (Supplementary Table 18 and Methods), and the results are available in Supplementary Table 19. Figures were captured (with slight modifications) from the online resource at http://fantom.gsc.riken.Jp/cat/v1/#/genes/ENSG00000232124.1.
Extended Data Figure 9 |
Extended Data Figure 9 |. Selective constraints and enrichment of GWAS trait and eQTL-associated SNPs at lncRNA loci.
a, Selective constraints between species (phastCons) and within human population (derived allele frequency). b, Enrichment of GWAS SNPs. Only lead GWAS SNPs were used (Methods). c, Enrichment of PICS fine-mapped SNPs in global (all versus all) or focused (immune versus immune) analysis (Methods). d, Enrichment of GTEx eQTL SNPs associated with expression of mRNAs. Circles represent means and the error bars represent their 99.99% confidence intervals.
Extended Data Figure 10 |
Extended Data Figure 10 |. Co-expression of various gene pairs linked by eQTL SNPs.
We searched for gene loci that overlap eQTL SNPs associated with expression variation of mRNAs (as identified by GTEx16). Gene loci overlapping these SNPs were then paired with the corresponding mRNA and their expression correlation across the FANTOM5 expression atlas was investigated. Rows compare the gene types overlapping the SNPs. a, mRNAs; b, all lncRNAs; c, divergent p-lncRNAs; d, intergenic p-lncRNAs; e, e-lncRNAs. Columns compare the relative orientation of the gene pairs and the position of the SNPs. The term ‘all’ refers to all orientations of the gene pairs and positions of the SNPs pooled. Gene pairs were binned on the basis of the number of SNPs linking the pair (bin = 5 SNPs). The data points represent the mean of absolute Spearman’s rho and the error bars represent its 99.99% confidence intervals. At each bin, the number of pairs plotted is the same for the three pair types as indicated.
Figure 1 |
Figure 1 |. Conservation of lncRNAs.
a, Categories of lncRNAs. b, Rejected substitution (RS) scores. Per-nucleotide values of the highest scoring window (200 nt) were plotted. Box plots show the median (dashed lines), quartiles and Tukey whiskers. Circles indicate functional lncRNAs from lncRNAdb. The filled, half-filled and empty circles represent different TIR and exon conservation scenarios as in c. c, Percentages of genes (grey scale) defined to have conserved TIR, exon or both, based on GERP elements. d, Percentages of all orthologous human TSSs. e, Percentages of active orthologous human TSSs.
Figure 2 |
Figure 2 |. Cell-type-specific lncRNAs implicated in GWAS traits.
a, Unsupervised clustering of cell types and traits based on the association of cell-type-enriched genes with trait-associated genes. All lncRNAs and all other genes were used. Only cell types and traits involved in significantly associated cell-type-trait pairs were plotted. Intensity represents the level of association measured as Z-score of the log-transformed FDR reciprocal in one-tailed Fisher’s exact test. Cell types and traits were clustered on the basis of the Z-score. Selected cell types and traits of six matching themes were colour-coded accordingly. Clusters for specific themes are highlighted in the dendrograms (Extended Data Fig. 7 for detailed views). b, Detailed view of the neural block, showing significant association of genes enriched in nervous system tissues and genes associated with neuropathy and behaviour traits. c, Contributions of gene categories within the neural block. Odds ratios were calculated on the basis of all genes, or other gene categories as indicated. d, Number of genes contributing to significantly associated cell-type-trait pairs.
Figure 3 |
Figure 3 |. LncRNAs implicated in eQTL.
a, Rationale of the analysis. Expression correlation of lncRNA-mRNA pairs (b) binned on the basis of distances between the pair and (c) binned on the basis of the number of eQTL-associated SNPs linking the pair. Circles represent the mean of absolute Spearman’s rho and the error bars represent their 99.99% confidence intervals. Asterisks indicate that the absolute Spearman’s rho for the eQTL-linked pairs is significantly higher than that of non-linked, distance- and orientation-matched cis random pairs (paired Student’s t-test, P < 0.05). d, Co-expression of an eQTL-linked lncRNA-mRNA pair; cpm, counts per million.
Figure 4 |
Figure 4 |. Functional evidence of human lncRNAs.
a, Venn diagram showing lncRNAs with conserved exon, conserved TIR, implicated in eQTL or implicated in GWAS traits. b, Enrichment of lncRNAs with conserved exon or TIR in lncRNAs implicated in eQTL or GWAS traits. ‘OR’ and ‘P’ refer to odds ratio and P value of one-tailed Fisher’s exact test. c, Level of conservation versus level of enrichment in lncRNAs implicated in eQTL or GWAS traits. Asterisks indicate lncRNAs at certain levels of conservation are significantly enriched in lncRNAs implicated in eQTL or GWAS traits (one-tailed Fisher’s exact test, P < 0.05).

Similar articles

See all similar articles

Cited by 218 articles

See all "Cited by" articles

Publication types

MeSH terms