Prediction of protein-coding small ORFs in multi-species using integrated sequence-derived features and the random forest model

Methods. 2023 Feb:210:10-19. doi: 10.1016/j.ymeth.2022.12.003. Epub 2023 Jan 5.

Abstract

Proteins encoded by small open reading frames (sORFs) can serve as functional elements playing important roles in vivo. Such sORFs also constitute the potential pool for facilitating the de novo gene birth, driving evolutionary innovation and species diversity. Therefore, their theoretical and experimental identification has become a critical issue. Herein, we proposed a protein-coding sORFs prediction method merely based on integrative sequence-derived features. Our prediction performance is better or comparable compared with other nine prevalent methods, which shows that our method can provide a relatively reliable research tool for the prediction of protein-coding sORFs. Our method allows users to estimate the potential expression of a queried sORF, which has been demonstrated by the correlation analysis between our possibility estimation and codon adaption index (CAI). Based on the features that we used, we demonstrated that the sequence features of the protein-coding sORFs in the two domains have significant differences implying that it might be a relatively hard task in terms of cross-domain prediction, hence domain-specific models were developed, which allowed users to predict protein-coding sORFs both in eukaryotes and prokaryotes. Finally, a web-server was developed and provided to boost and facilitate the study of the related field, which is freely available at http://guolab.whu.edu.cn/codingCapacity/index.html.

Keywords: Protein-coding sORFs; Random forest-based model; Sequence-derived features.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Open Reading Frames / genetics
  • Random Forest*