Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Mar 27;507(7493):462-70.
doi: 10.1038/nature13182.

A Promoter-Level Mammalian Expression Atlas

FANTOM Consortium and the RIKEN PMI and CLST (DGT)Alistair R R ForrestHideya KawajiMichael RehliJ Kenneth BaillieMichiel J L de HoonVanja HaberleTimo LassmannIvan V KulakovskiyMarina LizioMasayoshi ItohRobin AnderssonChristopher J MungallTerrence F MeehanSebastian SchmeierNicolas BertinMette JørgensenEmmanuel DimontErik ArnerChristian SchmidlUlf SchaeferYulia A MedvedevaCharles PlessyMorana VitezicJessica SeverinColin A SempleYuri IshizuRobert S YoungMargherita FrancescattoIntikhab AlamDavide AlbaneseGabriel M AltschulerTakahiro ArakawaJohn A C ArcherPeter ArnerMagda BabinaSarah RenniePiotr J BalwierzAnthony G BeckhouseSwati Pradhan-BhattJudith A BlakeAntje BlumenthalBeatrice BodegaAlessandro BonettiJames BriggsFrank BrombacherA Maxwell BurroughsAndrea CalifanoCarlo V CannistraciDaniel CarbajoYun ChenMarco ChiericiYari CianiHans C CleversEmiliano DallaCarrie A DavisMichael DetmarAlexander D DiehlTaeko DohiFinn DrabløsAlbert S B EdgeMatthias EdingerKarl EkwallMitsuhiro EndohHideki EnomotoMichela FagioliniLynsey FairbairnHai FangMary C Farach-CarsonGeoffrey J FaulknerAlexander V FavorovMalcolm E FisherMartin C FrithRie FujitaShiro FukudaCesare FurlanelloMasaaki FurinoJun-ichi FurusawaTeunis B GeijtenbeekAndrew P GibsonThomas GingerasDaniel GoldowitzJulian GoughSven GuhlReto GulerStefano GustincichThomas J HaMasahide HamaguchiMitsuko HaraMatthias HarbersJayson HarshbargerAkira HasegawaYuki HasegawaTakehiro HashimotoMeenhard HerlynKelly J HitchensShannan J Ho SuiOliver M HofmannIlka HoofFurni HoriLukasz HuminieckiKei IidaTomokatsu IkawaBoris R JankovicHui JiaAnagha JoshiGiuseppe JurmanBogumil KaczkowskiChieko KaiKaoru KaidaAi KaihoKazuhiro KajiyamaMutsumi Kanamori-KatayamaArtem S KasianovTakeya KasukawaShintaro KatayamaSachi KatoShuji KawaguchiHiroshi KawamotoYuki I KawamuraTsugumi KawashimaJudith S KempfleTony J KennaJuha KereLevon M KhachigianToshio KitamuraS Peter KlinkenAlan J KnoxMiki KojimaSoichi KojimaNaoto KondoHaruhiko KosekiShigeo KoyasuSarah KrampitzAtsutaka KubosakiAndrew T KwonJeroen F J LarosWeonju LeeAndreas LennartssonKang LiBerit LiljeLeonard LipovichAlan Mackay-SimRi-ichiroh ManabeJessica C MarBenoit MarchandAnthony MathelierNiklas MejhertAlison MeynertYosuke MizunoDavid A de Lima MoraisHiromasa MorikawaMitsuru MorimotoKazuyo MoroEfthymios MotakisHozumi MotohashiChristine L MummeryMitsuyoshi MurataSayaka Nagao-SatoYutaka NakachiFumio NakaharaToshiyuki NakamuraYukio NakamuraKenichi NakazatoErik van NimwegenNoriko NinomiyaHiromi NishiyoriShohei NomaShohei NomaTadasuke NoazakiSoichi OgishimaNaganari OhkuraHiroko OhimiyaHiroshi OhnoMitsuhiro OhshimaMariko Okada-HatakeyamaYasushi OkazakiValerio OrlandoDmitry A OvchinnikovArnab PainRobert PassierMargaret PatrikakisHelena PerssonSilvano PiazzaJames G D PrendergastOwen J L RackhamJordan A RamilowskiMamoon RashidTimothy RavasiPatrizia RizzuMarco RoncadorSugata RoyMorten B RyeEri SaijyoAntti SajantilaAkiko SakaShimon SakaguchiMizuho SakaiHiroki SatoSuzana SavviAlka SaxenaClaudio SchneiderErik A SchultesGundula G Schulze-TanzilAnita SchwegmannThierry SengstagGuojun ShengHisashi ShimojiYishai ShimoniJay W ShinChristophe SimonDaisuke SugiyamaTakaai SugiyamaMasanori SuzukiNaoko SuzukiRolf K SwobodaPeter A C 't HoenMichihira TagamiNaoko TakahashiJun TakaiHiroshi TanakaHideki TatsukawaZuotian TatumMark ThompsonHiroo ToyodoTetsuro ToyodaElvind ValenMarc van de WeteringLinda M van den BergRoberto VeradoDipti VijayanIlya E VorontsovWyeth W WassermanShoko WatanabeChristine A WellsLouise N WinteringhamErnst WolvetangEmily J WoodYoko YamaguchiMasayuki YamamotoMisako YonedaYohei YonekuraShigehiro YoshidaSusan E ZabierowskiPeter G ZhangXiaobei ZhaoSilvia ZucchelliKim M SummersHarukazu SuzukiCarsten O DaubJun KawaiPeter HeutinkWinston HideTom C FreemanBoris LenhardVladimir B BajicMartin S TaylorVsevolod J MakeevAlbin SandelinDavid A HumePiero CarninciYoshihide Hayashizaki
Free PMC article

A Promoter-Level Mammalian Expression Atlas

FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al. Nature. .
Free PMC article

Abstract

Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly 'housekeeping', whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell-type-specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter-based expression analysis reveals key transcription factors defining cell states and links them to binding-site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample ontology enrichment analyses. The functional annotation of the mammalian genome 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type-specific transcriptomes with wide applications in biomedical research.

Conflict of interest statement

The authors declare no competing financial interests. Readers are welcome to comment on the online version of the paper.

Figures

Extended Data Figure 1
Extended Data Figure 1. Decomposition-based peak identification (DPI)
a, Schematic representation of each step in the peak identification. This starts from CAGE profiles at individual biological states (I), subsequently defines tag clusters (consecutive genomic region producing CAGE signals) over the accumulated CAGE profiles across all the states (II). Within each of the tag cluster, it infers up to five underlying signals (independent components) by using ICA independent component analysis (ICA) (III). It smoothens each of the independent components and finds peaks where signal is higher than the median (IV). The peaks along the individual components are finally merged if they are overlapping each other (V). b, c, Genomic view of actual examples (B4GALT1 locus) for human and mouse. CAGE profiles across the biological states (I) are shown as a greyscale plot, in which the x axis represents the genomic coordinates and individual rows represent individual biological states. Dark (or black) dots indicate frequent observation of transcription initiation (that is, larger number of CAGE read counts) and light dots (white) indicate less frequency. The blue histogram on the top indicates the accumulated CAGE read counts, and the entire region shown represents a single tag cluster (II). The histograms below the greyscale plot indicate the independent components of the CAGE signals inferred by ICA (III), and the resulting CAGE peaks are shown at the blue bars closest to the bottom (V). The bottom track indicates a gene model in RefSeq. The figures overall indicate that only one TSS is defined by RefSeq gene models in this locus, however, transcription starts from slightly different regions depending on the context, and the DPI method successfully captured the different initiation events. d, Breakdown of singleton and composite transcription initiation regions with homogenous or heterogeneous expression patterns according to likelihood ratio test (see Supplementary Methods).
Extended Data Figure 2
Extended Data Figure 2. Broad and sharp promoters
DPI peaks from the permissive set were aggregated by grouping neighbouring peaks less than 100 bp apart. Cumulative distribution of CAGE signal along each region was calculated and positions of 10th and 90th percentiles were determined. a, Schematic representation of CAGE signal within promoter region and calculation of interquantile width. Signal from CAGE transcription start sites (CTSS) is shown. Distance between these two positions (interquantile width) was used as a measure of promoter width. b, Distribution of promoter interquantile width across all 988 human samples. Individual grey lines show distribution in each sample and the average distribution is shown in yellow. For each sample only promoters with > = 5 TPM were selected. Distribution of obtained interquantile width was clearly bimodal and allowed us to set the empirical threshold at 10.5 bp that separates the best sharp from broad promoters. c, Distribution of expression specificity. The distribution of log ratios of expression in individual samples against the median expression across all samples is shown separately for sharp and broad promoters. Solid line shows the average distribution for all samples and the semi-transparent band denotes the 99% confidence interval. The dashed line corresponds to an expected log ratio if all samples contributed equally to the total expression. d, Average frequency of AA/AT/TA/TT (WW) dinucleotides around dominant TSS of sharp (red) and broad (blue) promoters across all human samples. Lines show the average signal and semi-transparent bands indicate the 99% confidence interval. Closer view of WW dinucleotide frequency displaying 10 bp periodicity is shown in the inset and indicates the likely position of the +1 nucleosome. For comparison, the signal aligned to randomly chosen TSS in broad promoters is shown in orange. e, As in a but for promoters in CD14+ monocytes. H2A.Z signal (subtracted coverage − plus strand coverage − minus strand coverage) around sharp and broad promoters is shown in corresponding semi-transparent colours (data from ref. 51). Transition point in subtracted coverage from positive to negative values indicates the most likely position of the nucleosome (shown as semitransparent blue circle) centre. f, As in b but for promoters in frontal lobe. H3K4me3 signal (subtracted coverage = plus strand coverage – minus strand coverage) around sharp and broad promoters is shown in corresponding semi-transparent colours (data from ref. 52).
Extended Data Figure 3
Extended Data Figure 3. Density plots of DPI peaks maximum and median expression
a, Distribution for all human robust peaks. b, Distribution for all mouse robust peaks. Fraction on left of vertical dashed line corresponds to peaks with non-ubiquitous (cell-type-restricted) expression patterns (median < 0.2 TPM). Fraction below the diagonal dashed line corresponds to ubiquitous-uniform (housekeeping) expression profiles (less than tenfold difference between maximum and median). Fraction in top-middle corresponds to ubiquitous-non-uniform expression profiles (maximum > tenfold median). ce Show distibutions based on cell line, primary cell and tissue data, respectively. The mixture of cells in tissues may overestimate the fraction of ubiquitously expressed genes. f, Boxplot showing the number of peaks and detected > = 10 TPM in primary cells, cell lines or tissues. g, As in a but showing transcription factor p1 peaks only. h, Boxplot showing maximum expression of the main promoter for transcription factors or all coding genes. i, Density plots of human robust DPI peaks maximum and median expression for the main promoter of coding genes. j, As in d but showing the main promoter of transcription factors. Fraction on the left of the vertical dashed line corresponds to peaks with non-ubiquitous (cell-type-restricted) expression patterns (median < 0.2 TPM). Fraction below the diagonal dashed line corresponds to ubiquitous-uniform (housekeeping) expression profiles (less than tenfold difference between max and median). Fraction above the diagonal and to the right of the vertical dashed lines corresponds to ubiquitous-non-uniform expression profiles (maximum > tenfold median). k, Distribution for peaks with CpG island only (n = 55,897). l, Distribution for peaks with only a TATA motif (n = 3,933). m, Distribution for peaks with both CpG islands and TATA box motifs (n = 834). n, Distribution for DPI peaks with neither a TATA motif nor CpG island (n = 124,152). Fraction on the left of the vertical dashed line corresponds to peaks with non-ubiquitous (cell-type-restricted) expression patterns (median < 0.2 TPM). Fraction below the diagonal dashed line corresponds to ubiquitous-uniform (housekeeping) expression profiles (less than tenfold difference between max and median). Fraction above diagonal and to right of vertical dashed lines corresponds to ubiquitous-non-uniform expression profiles (maximum > tenfold median).
Extended Data Figure 4
Extended Data Figure 4. Cross-species projected super-clusters
a, The number of mouse and human TSSs (both permissive and robust) per projected super-cluster. b, Same data as presented in panel a, with the y axis on a log scale. There is a slight tendency for more human TSSs per super-cluster than mouse TSSs. c, The number of human and mouse TSSs per projected super-cluster, density of data points indicated by log-scaled colour gradient shown on the right. Most super-clusters contain < = 4 DPI defined TSSs in both species. d, Evaluating the conservation of TSS annotation between species. Projected super-clusters are annotated by the most functional contributing TSS from each species (see Methods). Grey shading in the margins summarizes the proportion of super-clusters with each category of annotation in both mouse (y axis) and human (x axis). Numbers and volumes of circles represent counts of projected super-clusters, for example there are 34,868 super-clusters in which > = 1 human and > = 1 mouse component TSS are annotated as protein coding and 719 super-clusters in which the human TSSs are unannotated and at least one of the mouse TSSs are annotated as the 5′ end of a non-coding transcript.
Extended Data Figure 5
Extended Data Figure 5. De novo derived, cell-state-specific motif signatures
a–c, The de novo motif discovery tools DMF, HOMER, ChIPMunk and ScanAll were applied to detect sequence motifs enriched in the vicinity of sample-specific peaks (a), yielding 8,699 de novo motifs (b). The coverage of known motif space by the de novo motifs was evaluated by comparing them to the SWISSREGULON, HOCOMOCO, TRANSFAC, HOMER, JASPAR, and ENCODE LEXICON motif collections. c, The remaining 1,221 de novo motifs that were not similar to known motifs were then clustered using MACRO-APE, resulting in 169 unique novel motifs. d, Known motifs from the HOMER database were annotated and counted in around cell-type-specific TSSs (−300 to +50 bp) associated with CpG islands (CGI) or non-CGI regions. eg, RNA Pol II ChIP-seq signal and motif finding in ‘housekeeping gene’ promoters with different absolute expression levels. Human housekeeping gene promoters were defined as (log10(max + 0.1) − log10(median + 0.1) < = 1). The resulting clusters were then extended by −300 and +50. Overlapping extended clusters were removed by only keeping those with the highest expression. e, Extended clusters were then split into 5 equal sized bins with decreasing absolute expression. f, RNA Pol II occupancy at binned clusters in ENCODE cell lines (highly expressed genes show the highest occupancy, but even bin5 clusters showing very low tag counts are still highly occupied). g, Bubble plot representation comparing known motif enrichments in bin1 (high expression) and bin5 (low expression) extended CAGE clusters. The bubble plots encode two quantitative parameters per motif: difference in motif occurrence between bin1 (x axis) and bin5 (y axis) as well as the adjusted P values for enrichment (bubble diameter). Colouring indicates significantly differentially distributed motifs (5% FDR). The right panel additionally summarizes the fraction of clusters in each bin that contain the indicated motifs along with the Benjamini Hochberg adjusted hypergeometric P value for differential enrichment.
Extended Data Figure 6
Extended Data Figure 6. Features of cell-type-specific promoters
a, The distribution of expression log ratios of all individual samples against the median of all samples is shown separately for CGI-associated and non-CGI-associated CAGE clusters. The dashed line corresponds to an expected log ratio if all samples contribute equally to the total expression. b, Histograms for genomic distance distributions of HepG2 DNase I hypersensitivity, H3K4me3, H2A.Z, POL2, P300, GABP, YY1, HNF4A, FOXA1 and FOXA2 ChIP-seq tag counts centred across CGI-associated and non-CGI-associated CAGE clusters (separated according to expression specificities) across a 2 kilobase (kb) genomic region. Expression specificity bins are colour-coded (as indicated in the DNase I panel) with blue representing the highest degree of specificity. Numbers of regions in bins are given in the GABP panel (CGI no. / nCGI no., colour coding as above). c, Histograms for genomic distance distributions of ChIP-seq-derived sequence motifs for GABP, YY1, HNF4A, FOXA1 and FOXA2 (corresponding to the samples in the lower panel of c) centred across CGI-associated and non-CGI-associated CAGE clusters (separated according to expression specificities) across a 2 kb genomic region. Motifs are shown on top. The percentage of promoters overlapping with ChIP-seq peaks (b) or consensus sequences (c) for transcription factors binding the highest specificity clusters (HNF4A, FOXA2, TCF7L2) is also given in blue. d, Plots showing mean expression specificity (high values indicate more constrained expression over cells, see the accompanying manuscript) in enhancers close to RefSeq promoters as a function of promoter CpG content and three classes of promoter expression specificity.
Extended Data Figure 7
Extended Data Figure 7. Extended features of cell-type-specific promoters
a, Distribution of global expression specificity estimated using primary cells, cell lines or tissues only. b, Distribution of expression specificity for HepG2, GM12878, HeLaS3, K562 and CD14+ monocytes (distribution of expression log ratios of all individual samples against the median of all samples is shown separately for CGI-associated and nonCGI-associated CAGE clusters. The dashed line corresponds to an expected log ratio if all samples contribute equally to the total expression). c, Histograms for genomic distance distributions of K562 DNase I hypersensitivity, H3K4me3, H2A.Z, POL2, P300, GATA1 ChIP-seq tag counts centred across CGI-associated and non-CGI-associated CAGE clusters (separated according to expression specificities) across a 2 kb genomic region. Expression specificity bins are colour-coded with blue representing the highest degree of specificity. d, DNase I hypersensitivity, H3K4me3, H2A.Z, POL2, P300 and IRF4 in GM12878. e, DNase I hypersensitivity, H3K4me3, H2A.Z in HeLaS3. f, DNase I hypersensitivity, H3K4me3, H2A.Z, PU.1 and CEBPB in CD14+ monocytes.
Extended Data Figure 8
Extended Data Figure 8. Transcription factor promoter expression profile clustering
a, Biolayout visualization of transcription factor coexpression in human primary cells (3,775 nodes, 54,892 edges r > 0.70, MCL2.2). b, Hierarchical coexpression clustering and heatmap of ETS family transcription factors across the entire human collection (only promoter1(p1) data shown).
Extended Data Figure 9
Extended Data Figure 9. Collapsed coexpression network for mouse coexpression groups
One node is one group of promoters. Derived from expression profiles of 116, 277 promoters across 402 primary cell types, tissues and cell lines (r > 0.75, MCLi = 2.2). For display, each group of promoters is collapsed into a sphere, the radius of which is proportional to the cube root of the number of promoters in that group. Edges indicate r > 0.6 between the average expression profiles of each cluster. Colours indicate loosely-associated collections of coexpression groups (MCLi = 1.2). Labels show representative descriptions of the dominant cell type in coexpression groups in each region of the network, and a selection of highly-enriched pathways (FDR < 10−4) from KEGG (K), WikiPathways (W), Netpath (N) and Reactome (R).
Extended Data Figure 10
Extended Data Figure 10. Annotated expression profiles of alternative promoters
Overlay of coexpression groups enriched for genes involved in the KEGG pathway for influenza A pathogenesis (hsa:05164; FDR < 0.1, n > 2). a, Collapsed coexpression network showing 5 groups enriched for influenza pathogenesis genes: C0 (blue), C26 (purple), C61 (yellow), C187 (green) and C413 (red). b, Excerpt from KEGG pathway diagram showing positions of genes in each coexpression group (background colours as in a). Pathway entities that map to two coexpression groups have the background colour of the smaller group, and the text/border colour of the larger group. Details and promoter-level displays (edges indicate r > 0.75) for two coexpression groups are displayed with transcripts mapping to KEGG pathway highlighted (inset). In this example the KEGG pathway for influenza A pathogenesis (hsa:05164) was strikingly over-represented in one small coexpression group in particular (C413, P value < 10−11, FDR = 4.5 × 10−10). Of 19 promoters in coexpression group 413, eight were present in the KEGG pathway, including RIG-I (DDX58), the gene encoding the receptor for the mitochondrial antiviral signalling pathway. Four of the remaining genes (TRIM21, TRIM22, RTP4 and XAF1) were found to be key host determinants of influenza virus replication in a high-throughput short interfering RNA (siRNA) screen, whereas another, PLSCR1, is required for a normal interferon response to influenza A. The top five transcription factor expression profiles most correlated with C413 were IRF7, IRF9, STAT1, SP100 and ZNFX1, and from motif enrichment analysis, the most frequent motifs found in promoters of cluster C413 were potential IRF-binding motifs. c, p1@IRF9 and p2@IRF9 expression ranked by the ubiquitously expressed p1@IRF9 promoter. d, As in a but ranked by expression of p2@IRF9. e, f, Similar to a and b but showing expression of p1@TRMT5 (housekeeping profile) and p2@TRMT5 (expressed in pathogen challenged monocytes). g, Histogram showing the number of different coexpression clusters (see Fig. 4) in which named genes with alternative promoters participate. The majority of genes with alternative promoters participate in more than one cluster; 17 genes participate in more than 10 different clusters and are not shown on this graph.
Extended Data Figure 11
Extended Data Figure 11. Sample ontology enrichment analysis (SOEA)
Expression profile-sample ontology associations were tested by Mann–Whitney rank sum test to identify cell, disease or anatomical ontology terms over-represented in ranked lists of samples expressing each peak. a, p1@CXCL6 enriched in vascular associated smooth muscle cells. b, p5@ST8SIA3 enriched in brain tissues. c, Novel peak enriched in mast cells. d, p1@KIAA0125 enriched in myeloid leukaemia. e, p1@BRI3 enriched in myeloid leukaemia. f, p1@BDNF enriched in fibroblasts. g, Novel peak enriched in leukocytes. h, Novel peak enriched in classical monocytes. i, j, Venn diagrams showing degree of overlap between peaks associated to known genes (blue), cell ontology enriched (yellow), Uberon anatomical ontology enriched (green) and disease ontology (red). i, At a threshold of 10−20 (Mann–Whitney rank sum test), 64% (59, 835 out of 93, 558) of the expression profiles of human known transcripts and 74% (67, 810 out of 91, 269) of the expression profiles for novel transcripts show enrichment for one or more sample ontologies. j, Mouse sample ontology enrichment 10−20 threshold. 30% (18, 273 out of 61, 134) known are enriched and 47% (26, 176 out of 55, 143) novel are enriched.
Extended Data Figure 12
Extended Data Figure 12. Sample-to-sample correlation graph
821 nodes are shown, 21,821 edges shown (r>0.75). a, Samples are coloured by sample type (primary cell, cell line or tissue). Note the separation of cell lines and primary cells. b, As in a, except major subgroups are coloured and labelled separately.
Figure 1
Figure 1. Promoter discovery and definition in FANTOM5
a, Samples profiled in FANTOM5. b, Reproducible cell-type-specific CAGE patterns observed for the 266 base CpG island associated B4GALT1 locus transcription initiation region hg19:chr9:33167138.33167403. CAGE profiles for CD4+ T cells (blue), CD14+ monocytes (gold), aortic smooth muscle cells (green) and the adrenal cortex adenocarcinoma cell line SW-13 (red) are shown. A combined pooled profile showing TSS distribution across the entire human collection is shown in black. Values on the y axis correspond to maximum normalized TPM for a single base in each track. c, Decomposition-based peak identification (DPI) finds 6 differentially used peaks within this composite transcription initiation region (note: peaks are labelled from p1@B4GALT1 with most tag support through to p7@B4GALT1 with the least tag support; p4@B4GALT1 is not shown and is in the 3′ UTR of the locus at position hg19::chr9:33111241.33111254−). Note in particular one large broad region on the left used in all samples and a sharp peak to the right, preferentially used in the aortic smooth muscle cells. d, Venn diagram showing DPI defined peaks expressed at ≥10 TPM in primary cells (red), tissues (blue) and cell lines (green). e, Fraction of unannotated peaks observed in subsets of d. P, primary cells, T, tissues, C, cell lines, PT, TC, PC and PTC correspond to peaks found in multiple sample types, for example, PT, found in primary cells and tissue samples.
Figure 2
Figure 2. Cell-type-restricted and housekeeping transcripts encoded in the mammalian genome
a, Density plot summarizing the distribution of relative log expression (RLE) normalized maximum and median TPM expression values for the 185K robustly detected human peaks identified by FANTOM5 (colour bar on right indicates relative density). Box and whiskers plots above and to right show distribution of median and maximum values in the data set (box shows the interquartile range). Promoters of named genes are highlighted to show extremes of expression level and expression breadth, note the alternative promoters of IRF9 and TRMT5 have different maximums and breadths of expression (see Extended Data Fig. 10). Fraction on left of the red vertical dashed line corresponds to peaks detected in less than 50% of samples with non-ubiquitous (cell-type-restricted) expression patterns (median < 0.2 TPM). Fraction below the red diagonal dashed line corresponds to ubiquitous-uniform (housekeeping) expression profiles (maximum < 10× median). Fraction above diagonal and to the right of the vertical dashed lines corresponds to ubiquitous-non-uniform expression profiles (maximum > 10× median). b, Box and whisker plots showing the distribution of expression levels for the same peaks as in a across the 889 samples (box shows the interquartile range).
Figure 3
Figure 3. TSS conservation as a function of expression properties and functional annotation
a, b, Human robust TSS coordinates were projected through EPO12 whole genome multiple sequence alignments (Supplementary Methods). The y-axis values show the fraction of human TSSs that align to an orthologous position in the indicated species. The x axis shows the relative divergence of macaque, dog and mouse genomes as the substitution rate at fourfold degenerate sites in protein coding sequence. The TSS locations were genome permuted (Supplementary Methods) and then projected through EPO12 alignments to give the null expectation (dashed blue line). The 95% confidence intervals of 1, 000 samples of 1, 000 TSS are shown (blue shading). a, TSS mapped to the 5′ ends of protein coding and non-coding transcripts are labelled (C and N, respectively), those that do not map to a known transcript 5′ end are shown as the ‘anonymous’ category. With the exception of anonymous, all robust TSSs represented in both panels are associated with the 59 ends of previously annotated transcripts. Non-ubiquitous (cell-type-restricted), ubiquitous-uniform (housekeeping) and non-uniform-ubiquitous were defined as in Fig. 2. Ultra-housekeeping TSSs were defined as those with less than fivefold difference between maximum and median. The category top 1000 UDE represents the 1,000 ubiquitous TSSs that are most differentially expressed. There are 1,016 ultra-housekeeping TSSs, 276 ubiquitous-uniform non-coding TSSs and all other categories contain over 2, 000 TSSs. b, Same axes as panel a showing TSSs with expression that is biased towards a single expression facet (larger mutually exclusive grouping of the primary cell and tissue samples based on the sample ontologies CO and UBERON, defined in ref. 4). Only expression facets with greater than 250 enriched TSSs are shown. For clarity, only a subset of expression facets are coloured and labelled.
Figure 4
Figure 4. Coexpression clustering of human promoters in FANTOM5
Collapsed coexpression network derived from 4,882 coexpression groups (one node is one group of promoters; 4,664 groups are shown here) derived from expression profiles of 124,090 promoters across all primary cell types, tissues and cell lines (visualized using Biolayout Express3D (ref. 45), r > 0.75, MCLi = 2.2). For display, each group of promoters is collapsed into a sphere, the radius of which is proportional to the cube root of the number of promoters in that group. Edges indicate r > 0.6 between the average expression profiles of each cluster. Colours indicate loosely-associated collections of coexpression groups (MCLi = 1.2). Labels show representative descriptions of the dominant cell type in coexpression groups in each region of the network, and a selection of highly-enriched pathways (FDR < 10−4) from KEGG (K), WikiPathways (W), Netpath (N) and Reactome (R). Promoters and genes in the coexpression groups are available online at (http://fantom.gsc.riken.jp/5/data/).

Similar articles

See all similar articles

Cited by 675 articles

See all "Cited by" articles

Publication types

MeSH terms

LinkOut - more resources

Feedback