Biosurfer for systematic tracking of regulatory mechanisms leading to protein isoform diversity

bioRxiv [Preprint]. 2024 Mar 17:2024.03.15.585320. doi: 10.1101/2024.03.15.585320.

Abstract

Long-read RNA sequencing has shed light on transcriptomic complexity, but questions remain about the functionality of downstream protein products. We introduce Biosurfer, a computational approach for comparing protein isoforms, while systematically tracking the transcriptional, splicing, and translational variations that underlie differences in the sequences of the protein products. Using Biosurfer, we analyzed the differences in 32,799 pairs of GENCODE annotated protein isoforms, finding a majority (70%) of variable N-termini are due to the alternative transcription start sites, while only 9% arise from 5' UTR alternative splicing. Biosurfer's detailed tracking of nucleotide-to-residue relationships helped reveal an uncommonly tracked source of single amino acid residue changes arising from the codon splits at junctions. For 17% of internal sequence changes, such split codon patterns lead to single residue differences, termed "ragged codons". Of variable C-termini, 72% involve splice- or intron retention-induced reading frameshifts. We found an unusual pattern of reading frame changes, in which the first frameshift is closely followed by a distinct second frameshift that restores the original frame, which we term a "snapback" frameshift. We analyzed long read RNA-seq-predicted proteome of a human cell line and found similar trends as compared to our GENCODE analysis, with the exception of a higher proportion of isoforms predicted to undergo nonsense-mediated decay. Biosurfer's comprehensive characterization of long-read RNA-seq datasets should accelerate insights of the functional role of protein isoforms, providing mechanistic explanation of the origins of the proteomic diversity driven by the alternative splicing. Biosurfer is available as a Python package at https://github.com/sheynkman-lab/biosurfer.

Keywords: Alternative splicing; GENCODE; LRS Special Issue; Transcription Initiation Site (TIS); alternative transcriptional start site (altTSS); intron retention; long-read sequencing; nonsense mediated decay (NMD); open reading frame (ORF); poison exon; protein isoforms; protein sequence; reading frame shift; sequence alignment; transcriptional start site (TSS); transcriptional termination site (TSS).

Publication types

  • Preprint