BAIUCAS: a novel BLAST-based algorithm for the identification of upstream open reading frames with conserved amino acid sequences and its application to the Arabidopsis thaliana genome

Bioinformatics. 2012 Sep 1;28(17):2231-41. doi: 10.1093/bioinformatics/bts303. Epub 2012 May 21.

Abstract

Motivation: Upstream open reading frames (uORFs) are often found in the 5'-untranslated regions of eukaryotic messenger RNAs. Some uORFs have been shown to encode functional peptides involved in the translational regulation of the downstream main ORFs. Comparative genomic approaches have been used in genome-wide searches for uORFs encoding bioactive peptides, and by comparing uORF sequences between a few selected species or among a small group of species, uORFs with conserved amino acid sequences (UCASs) have been identified in plants, mammals and insects. Regulatory regions within uORF-encoded peptides that are involved in translational control are typically 10-20 amino acids long. Detection of homology between such short regions largely depends on the selection of species for comparison. To maximize the chances of identifying UCASs with short conserved regions, we devised a novel algorithm for homology search among a large number of species and the automatic selection of uORFs conserved in a wide range of species.

Results: In this study, we developed the BAIUCAS (BLAST-based algorithm for identification of UCASs) method and identified 18 novel Arabidopsis uORFs whose amino acid sequences are conserved across diverse eudicot species, which include uORFs not found in previous comparative genomic studies due to low sequence conservation among species. Therefore, BAIUCAS is a powerful method for the identification of UCASs, and it is particularly useful for the detection of uORFs with a small number of conserved amino acid residues.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • 5' Untranslated Regions
  • Algorithms*
  • Amino Acid Sequence
  • Amino Acids / genetics
  • Arabidopsis / genetics*
  • Base Sequence
  • Conserved Sequence*
  • Genome, Plant*
  • Genomics / methods
  • Open Reading Frames*
  • RNA, Messenger / genetics
  • Regulatory Sequences, Nucleic Acid
  • Sequence Homology
  • Species Specificity

Substances

  • 5' Untranslated Regions
  • Amino Acids
  • RNA, Messenger