Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May 2;33(9):981-93.
doi: 10.1002/embj.201488411. Epub 2014 Apr 4.

Identification of Small ORFs in Vertebrates Using Ribosome Footprinting and Evolutionary Conservation

Free PMC article

Identification of Small ORFs in Vertebrates Using Ribosome Footprinting and Evolutionary Conservation

Ariel A Bazzini et al. EMBO J. .
Free PMC article


Identification of the coding elements in the genome is a fundamental step to understanding the building blocks of living systems. Short peptides (< 100 aa) have emerged as important regulators of development and physiology, but their identification has been limited by their size. We have leveraged the periodicity of ribosome movement on the mRNA to define actively translated ORFs by ribosome footprinting. This approach identifies several hundred translated small ORFs in zebrafish and human. Computational prediction of small ORFs from codon conservation patterns corroborates and extends these findings and identifies conserved sequences in zebrafish and human, suggesting functional peptide products (micropeptides). These results identify micropeptide-encoding genes in vertebrates, providing an entry point to define their function in vivo.


Figure 1
Figure 1. Ribosome profiling in zebrafish

Schematic representation of ribosome profiling: 28 to 29-nt-long ribosome-protected fragments (RPFs) are generated from nuclease digestion, where the P-site of the ribosome is in position 13.

Developmental stages at which ribosome profiling was performed.

Subcodon position of the ribosome footprints (position 13) for the RPF and input reads. Plot shows the proportion of RPFs or input reads aligned to the coding sequence of RefSeq genes at each position relative to the codon. Input reads were obtained after poly-(A) fractionation and random fragmentation of the naked RNA.

RPFs and input reads mapped to a composite RefSeq transcript. RPFs mainly map to the CDS with a 3-nucleotide periodicity. RPF reads are colored as in (C) based on the position with respect to the frame of the CDS. Input reads map to both the UTRs and CDS (gray).

Subcodon profile plot showing RPF and input reads aligned to actinb1. Reads are colored based on the frame (1, 2 or 3) position relative to the transcript (Michel et al, 2012). All putative ORFs (distal AUG-Stop) were also colored for each respective frame (blue, pink and green boxes). Note that most of the RPFs from the annotated ORF match the color of the box, consistent with a strong in-frame distribution of reads within individual transcripts.

Figure 2
Figure 2. Defining actively translated regions by ribosome profiling

Workflow to define the ORFscore: Top diagram represents a transcript, below solid bars represent all possible ORFs (Distal AUG-Stop) identified in each frame (+1, +2, +3). The RPF distribution in each frame is compared to an equally sized uniform distribution using a modified chi-squared statistic (see Materials and Methods). The resulting ORFscore is assigned a negative value when the distribution of RPFs is inconsistent with the frame of the CDS.

Coverage is determined by measuring the proportion of in-frame CDS positions with ≥ 1 reads.

Figure 3
Figure 3. ORFscore discriminates translated from non-translated regions
A–D Scatterplot of the ORFscore and coverage for all ORFs (A), the subset of ORFs with the highest ORFscore per transcript (B) and short (20–100 aa) annotated CDS (D). Relative density plots (scaled to the maximum value for each group) of the ORFscore and coverage are shown for each ORF type. Note the separation between annotated ORFs from the rest of the ORFs, even for short (20–100 aa) annotated CDSs. (C) Color code used to label different ORF types found in RefSeq protein-coding transcripts: annotated CDS (green), 5′UTR ORFs (purple), 3′UTR ORFs (red) and ORFs overlapping the annotated CDS (orange). E Bar plots representing the number of ORFs identified on the basis of their ORFscore and coverage and defined as translated for each ORF type as in (C). Among all putative ORFs, the distribution of annotated ORFs was significantly different from the overall set (P = 2.2e-16, chi-squared test) with long and short CDS showing the highest fold-change enrichment in translated ORFs compared to other ORF types.
Figure 4
Figure 4. Identification of small coding ORFs (smORFs) in non-coding RNAs
A Scatterplot of ORFscore and coverage for the ORF with highest ORFscore per transcript. Shown are annotated short ORF (20–100 aa) (green), annotated lincRNA and “processed transcripts” from Ensembl (orange), non-coding RNAs described by Ulitsky et al (2011) (set 1, dark blue) and by Pauli et al (2012) (set 2, light blue) and ORFs in annotated 3′UTR used as negative control (red). Note that several ORFs in non-coding annotated transcripts score at comparable levels to annotated CDSs. Inset shows the scatter plot for annotated smORFs and 3′UTR ORFs. Relative density plots (scaled to the maximum value for each group) of the ORFscore and coverage are shown for each ORF type. B Subcodon profile plot showing a known non-coding RNA, cyrano, depleted of ribosome footprints. C Stacked plot showing the proportion of genes in which a translated ORF was defined by ORFscore and 10% coverage (*, stringent) or only ORFscore (**, permissive) and transcripts with low ORFscore (undetermined). The number of transcripts in each fraction is indicated. D Pie chart of BLASTp results against several organisms for the 241 newly defined translated regions, collapsed on amino acid sequence. E Bar plot showing the number of unique novel smORFs and Ensembl-predicted smORFs (≤ 100 aa), defined by ORFscore and 10% coverage (*, stringent and predicted) or only ORFscore (**, permissive). F Bar plot displaying the number of novel and Ensembl-predicted smORFs identified by tandem mass spectrometry (MS-MS). G Box plot representing the size distribution of the ORFs defined by ORFscore and MS-MS. H Bar plot showing the number of genes with translated ORFs in the 5′ or 3′ UTR defined by ORFscore or detected by MS-MS. I, J Subcodon profile plots showing individual examples of identified smORFs: Ribosome profiling data show the translated ORF and fragmentation spectra identifying the encoded peptides. K Heat-map showing dynamic expression of novel smORF-containing genes during zebrafish embryogenesis (= 190).
Figure 5
Figure 5. Computational identification of evolutionarily conserved smORFs (MicPDP)
A Number of smORFs detected within putative non-coding RNA transcripts in zebrafish and human. B,C Scatterplot of ORFscore and phyloCSF score for 686 zebrafish and 45,079 human smORFs with sufficient alignment coverage. The predictions of the two methods have small but significant overlap (light blue dots; < 2e-22 and < 6.3e-9 respectively, Fisher's exact test), and zebrafish experimental and computational results are correlated (Spearman's ρ = 0.49, P < 4e-42). D Scatterplot of ORFscore and coverage for 2,000 randomly selected human Ensembl-annotated coding ORFs (green), 2,000 ORFs in the 3′UTR and the set of coding ORFs from human lincRNAs as defined by ORFscore (blue, best ORFscore per unique genomic locus). E Subcodon profile plot, showing a smORF in the human predicted non-coding RNA ENST00000426713 (LINC00116-002) that presented high phyloCSF score and ORFscore.

Comment in

Similar articles

See all similar articles

Cited by 213 articles

See all "Cited by" articles

Publication types

Associated data

LinkOut - more resources