moPepGen: Rapid and Comprehensive Proteoform Identification

Chenghao Zhu; Lydia Y Liu; Takafumi N Yamaguchi; Helen Zhu; Rupert Hugh-White; Julie Livingstone; Yash Patel; Thomas Kislinger; Paul C Boutros

doi:10.1101/2024.03.28.587261

moPepGen: Rapid and Comprehensive Proteoform Identification

bioRxiv [Preprint]. 2024 Mar 31:2024.03.28.587261. doi: 10.1101/2024.03.28.587261.

Authors

Chenghao Zhu^{1

2

3

4}, Lydia Y Liu^{1

2

5

6

7}, Takafumi N Yamaguchi^{1

2

3}, Helen Zhu^{5

6

7}, Rupert Hugh-White^{1

2

3}, Julie Livingstone^{1

2

3}, Yash Patel^{1

2

3}, Thomas Kislinger^{5

6}, Paul C Boutros^{1

2

3

4

5}

Affiliations

¹ Department of Human Genetics, University of California, Los Angeles, CA, USA.
² Jonsson Comprehensive Cancer Center, University of California, Los Angeles, CA, USA.
³ Institute for Precision Health, University of California, Los Angeles, CA, USA.
⁴ Department of Urology, University of California, Los Angeles, CA, USA.
⁵ Department of Medical Biophysics, University of Toronto, Toronto, Canada.
⁶ Princess Margaret Cancer Centre, University Health Network, Toronto, Canada.
⁷ Vector Institute for Artificial Intelligence, Toronto, Canada.

Abstract

Gene expression is a multi-step transformation of biological information from its storage form (DNA) into functional forms (protein and some RNAs). Regulatory activities at each step of this transformation multiply a single gene into a myriad of proteoforms. Proteogenomics is the study of how genomic and transcriptomic variation creates this proteoform diversity, and is limited by the challenges of modeling the complexities of gene-expression. We therefore created moPepGen, a graph-based algorithm that comprehensively enumerates proteoforms in linear time. moPepGen works with multiple technologies, in multiple species and on all types of genetic and transcriptomic data. In human cancer proteomes, it detects and quantifies previously unobserved noncanonical peptides arising from germline and somatic genomic variants, noncoding open reading frames, RNA fusions and RNA circularization. By enabling efficient identification and quantitation of previously hidden proteins in both existing and new proteomic data, moPepGen facilitates all proteogenomics applications. It is available at: https://github.com/uclahs-cds/package-moPepGen.

Publication types

Preprint

Abstract

Publication types

Grants and funding