µProteInS-a proteogenomics pipeline for finding novel bacterial microproteins encoded by small ORFs

Eduardo Vieira de Souza; Pedro Ferrari Dalberto; Vinicius Pellisoli Machado; Adriana Canedo; Alan Saghatelian; Pablo Machado; Luiz Augusto Basso; Cristiano Valim Bizarro

doi:10.1093/bioinformatics/btac115

µProteInS-a proteogenomics pipeline for finding novel bacterial microproteins encoded by small ORFs

Bioinformatics. 2022 Apr 28;38(9):2612-2614. doi: 10.1093/bioinformatics/btac115.

Authors

Eduardo Vieira de Souza^{1

2

3}, Pedro Ferrari Dalberto¹, Vinicius Pellisoli Machado¹, Adriana Canedo¹, Alan Saghatelian³, Pablo Machado^{1

2

4}, Luiz Augusto Basso^{1

2

4}, Cristiano Valim Bizarro^{1

2}

Affiliations

¹ Instituto Nacional de Ciência e Tecnologia em Tuberculose, Centro de Pesquisas em Biologia Molecular e Funcional (CPBMF), Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), 90619-900, Partenon, Porto Alegre, Brazil.
² Programa de Pós-Graduação em Biologia Celular e Molecular, Escola de Ciências da Saúde e da Vida, PUCRS, Partenon, Porto Alegre, Brazil.
³ Clayton Foundation Laboratories for Peptide Biology, Salk Institute for Biological Studies, La Jolla, CA, USA.
⁴ Programa de Pós-Graduação em Medicina e Ciências da Saúde, Escola de Medicina, PUCRS, Partenon, Porto Alegre, Brazil.

PMID: 35188179
DOI: 10.1093/bioinformatics/btac115

Abstract

Summary: Genome annotation pipelines traditionally exclude open reading frames (ORFs) shorter than 100 codons to avoid false identifications. However, studies have been showing that these may encode functional microproteins with meaningful biological roles. We developed µProteInS, a proteogenomics pipeline that combines genomics, transcriptomics and proteomics to identify novel microproteins in bacteria. Our pipeline employs a model to filter out low confidence spectra, to avoid the need for manually inspecting Mass Spectrometry data. It also overcomes the shortcomings of traditional approaches that usually exclude overlapping genes, leaderless transcripts and non-conserved sequences, characteristics that are common among small ORFs (smORFs) and hamper their identification.

Availability and implementation: µProteInS is implemented in Python 3.8 within an Ubuntu 20.04 environment. It is an open-source software distributed under the GNU General Public License v3, available as a command-line tool. It can be downloaded at https://github.com/Eduardo-vsouza/uproteins and either installed from source or executed as a Docker image.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Bacteria / genetics
Genomics / methods
Open Reading Frames
Proteogenomics* / methods
Software