Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Sep 14:16:179.
doi: 10.1186/s13059-015-0742-x.

Extensive identification and analysis of conserved small ORFs in animals

Affiliations

Extensive identification and analysis of conserved small ORFs in animals

Sebastian D Mackowiak et al. Genome Biol. .

Abstract

Background: There is increasing evidence that transcripts or transcript regions annotated as non-coding can harbor functional short open reading frames (sORFs). Loss-of-function experiments have identified essential developmental or physiological roles for a few of the encoded peptides (micropeptides), but genome-wide experimental or computational identification of functional sORFs remains challenging.

Results: Here, we expand our previously developed method and present results of an integrated computational pipeline for the identification of conserved sORFs in human, mouse, zebrafish, fruit fly, and the nematode C. elegans. Isolating specific conservation signatures indicative of purifying selection on amino acid (rather than nucleotide) sequence, we identify about 2,000 novel small ORFs located in the untranslated regions of canonical mRNAs or on transcripts annotated as non-coding. Predicted sORFs show stronger conservation signatures than those identified in previous studies and are sometimes conserved over large evolutionary distances. The encoded peptides have little homology to known proteins and are enriched in disordered regions and short linear interaction motifs. Published ribosome profiling data indicate translation of more than 100 novel sORFs, and mass spectrometry data provide evidence for more than 70 novel candidates.

Conclusions: Taken together, we identify hundreds of previously unknown conserved sORFs in major model organisms. Our computational analyses and integration with experimental data show that these sORFs are expressed, often translated, and sometimes widely conserved, in some cases even between vertebrates and invertebrates. We thus provide an integrated resource of putatively functional micropeptides for functional validation in vivo.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Identification of conserved sORFs in five animals. a Overview of the pipeline. (1) Annotated transcripts are searched for ORFs and specific conservation features are extracted from the multiple species alignment (2). (3) A SVM classifier is used to predict coding sORFs (≤100 aa) with high specificity and sensitivity (b). (4) sORFs overlapping with larger predicted sORFs or with conserved annotated coding exons are removed (c). d Distribution of predicted sORFs in different regions of the transcriptome. e Length distribution of predicted sORFs
Fig. 2
Fig. 2
Predicted sORFs are under purifying selection and often widely conserved. a Adjusted phyloCSF scores for predicted sORFs are higher than those from control sORFs matched by their nucleotide conservation level (phastCons). b The dN/dS ratio of SNPs for novel predicted sORFs is smaller than for control ORFs in non-coding regions of the transcriptome, but larger than for annotated sORFs. c Percentage of sORFs conserved in ancestral species as inferred from the multiple species alignment. Numbers for informative ancestors are indicated (for example, the ancestors of primates, placental mammals, and jawed vertebrates for H. sapiens). Symbols mark different reference species as in (d). d homology clustering of predicted sORFs in different species; only clusters with at least one non-annotated member and members from more than one species are shown, with multiplicity indicated. ***P <0.001; **P <0.01; *P <0.05; Mann-Whitney tests in a, reciprocal Χ2 tests in b
Fig. 3
Fig. 3
Predicted sORFs are under stronger selection than those found in other studies. Previous results obtained by ribosome profiling (a-c), mass spectrometry (d-f) or computationally (g-o) are compared with respect to their adjusted phyloCSF scores and the dN/dS ratio as indicated in the scheme (top left). For each publication analyzing sORFs in different organisms and genomic regions, the numbers of predicted sORFs that are also predicted here (before overlap filter) or at least analyzed, respectively, are given. phyloCSF scores and dN/dS ratios are compared for the sORFs that are predicted either here or in another study but not in both. tw: this work. ***P <0.001; **P <0.01; *P <0.05, using Mann-Whitney (phyloCSF scores) and reciprocal Χ2 tests (dN/dS), respectively
Fig. 4
Fig. 4
Properties of encoded peptide sequences. a Only a small fraction of novel peptides has significant homology to known longer proteins. b Novel predicted peptides are more disordered than annotated short proteins or conceptual products from length-matched control ORFs in non-coding regions, and they also have a higher density of linear peptide motifs (c). d Some novel sORFs are predicted to contain signal peptide sequences, but not consistently more than expected. ***P <0.001; **P <0.01; *P <0.05, Mann-Whitney tests in b and c, binomial test in d
Fig. 5
Fig. 5
dORFs (sORFs in 3′UTRs) are not explained by stop-codon read-through or alternative terminal exons. Results are shown for H. sapiens. a The step in the phastCons conservation track near the stop codon of the upstream CDS is only slightly less pronounced than for CDS without downstream conserved sORF. b The dORFs are closer to the CDS than control sORFs, but they are not more often in the same frame (c), and they have a similarly high number of intervening in-frame stop codons (d). e The step in the phastCons conservation track near start of predicted dORFs start is more pronounced than in other dORFs. f Even before applying the overlap filter, very few predicted dORFs overlap with annotated coding exons. ***P <0.001; **P <0.01; *P <0.05; n.s. not significant. Mann-Whitney tests in a, d, and e, Kolmogorov-Smirnov test in b, Χ2 test in c, Binomial test in f
Fig. 6
Fig. 6
Experimental evidence supports translation of predicted sORFs and protein expression. a Translation is detected using the ORFscore method [10] on published ribosome profiling data. The Kolmogorov-Smirnov D-statistic is used to assess the performance of the dataset by comparing annotated sORFs to the negative control (dark gray). Length-matched non-conserved sORFs from non-coding transcriptome regions are included for comparison (light gray). ***P <0.001; **P <0.01; *P <0.05 (Mann-Whitney test). b Peptide expression of many predicted sORFs is confirmed by mining in house and published mass spectrometry datasets from cell lines and model organisms

Comment in

  • Finding smORFs: getting closer.
    Couso JP. Couso JP. Genome Biol. 2015 Sep 14;16(1):189. doi: 10.1186/s13059-015-0765-3. Genome Biol. 2015. PMID: 26364669 Free PMC article.

Similar articles

Cited by

References

    1. ENCODE Project Consortium. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. doi: 10.1038/nature05874. - DOI - PMC - PubMed
    1. Ulitsky I, Bartel DP. lincRNAs: Genomics, evolution, and mechanisms. Cell. 2013;154:26–46. doi: 10.1016/j.cell.2013.06.020. - DOI - PMC - PubMed
    1. Engreitz JM, Pandya-Jones A, McDonel P, Shishkin A, Sirokman K, Surka C, et al. The Xist lncRNA exploits three-dimensional genome architecture to spread across the X chromosome. Science. 2013;341:1237973. doi: 10.1126/science.1237973. - DOI - PMC - PubMed
    1. Cesana M, Cacchiarelli D, Legnini I, Santini T, Sthandier O, Chinappi M, et al. A long noncoding RNA controls muscle differentiation by functioning as a competing endogenous RNA. Cell. 2011;147:358–69. doi: 10.1016/j.cell.2011.09.028. - DOI - PMC - PubMed
    1. Bassett AR, Akhtar A, Barlow DP, Bird AP, Brockdorff N, Duboule D, et al. Considerations when investigating lncRNA function in vivo. Elife. 2014;3:e03058. doi: 10.7554/eLife.03058. - DOI - PMC - PubMed

Publication types

LinkOut - more resources