Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 2;17(11):3681-3692.
doi: 10.1021/acs.jproteome.8b00295. Epub 2018 Oct 19.

ProteomeGenerator: A Framework for Comprehensive Proteomics Based on De Novo Transcriptome Assembly and High-Accuracy Peptide Mass Spectral Matching

Affiliations
Free PMC article

ProteomeGenerator: A Framework for Comprehensive Proteomics Based on De Novo Transcriptome Assembly and High-Accuracy Peptide Mass Spectral Matching

Paolo Cifani et al. J Proteome Res. .
Free PMC article

Abstract

Modern mass spectrometry now permits genome-scale and quantitative measurements of biological proteomes. However, analysis of specific specimens is currently hindered by the incomplete representation of biological variability of protein sequences in canonical reference proteomes and the technical demands for their construction. Here, we report ProteomeGenerator, a framework for de novo and reference-assisted proteogenomic database construction and analysis based on sample-specific transcriptome sequencing and high-accuracy mass spectrometry proteomics. This enables the assembly of proteomes encoded by actively transcribed genes, including sample-specific protein isoforms resulting from non-canonical mRNA transcription, splicing, or editing. To improve the accuracy of protein isoform identification in non-canonical proteomes, ProteomeGenerator relies on statistical target-decoy database matching calibrated using sample-specific controls. Its current implementation includes automatic integration with MaxQuant mass spectrometry proteomics algorithms. We applied this method for the proteogenomic analysis of splicing factor SRSF2 mutant leukemia cells, demonstrating high-confidence identification of non-canonical protein isoforms arising from alternative transcriptional start sites, intron retention, and cryptic exon splicing as well as improved accuracy of genome-scale proteome discovery. Additionally, we report proteogenomic performance metrics for current state-of-the-art implementations of SEQUEST HT, MaxQuant, Byonic, and PEAKS mass spectral analysis algorithms. Finally, ProteomeGenerator is implemented as a Snakemake workflow within a Singularity container for one-step installation in diverse computing environments, thereby enabling open, scalable, and facile discovery of sample-specific, non-canonical, and neomorphic biological proteomes.

Keywords: de novo database construction; peptide fractionation; peptide−spectral matching; protein isoform analysis; proteogenomics; scoring function; transcriptomics.

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1.
Figure 1.
ProteomeGenerator overview. Transcriptomes and proteomes from the same biologic sample are analyzed in parallel by high-coverage Illumina sequencing and high-resolution, high-accuracy mass spectrometry, respectively. ProteomeGenerator assembles fastq-for-matted mRNA sequencing reads into predicted transcripts, identifies reading frames and isoforms, and produces Fasta-formatted proteogenomic (PGX) databases containing canonical and non-canonical expressed protein isoforms for subsequent mass spectrometry searches.
Figure 2.
Figure 2.
Schema for the ProteomeGenerator snakemake workflow. Sequencing reads are aligned using STAR followed by their de novo or referenced assembly intro transcriptomes using StringTie and processing to identify reading frames and protein isoforms. The resulting protein database is set as the target for peptide–mass spectral matching using MaxQuant.
Figure 3.
Figure 3.
Comparison of the canonical and proteogenomic protein databases displaying (A) number of protein entries (B) and theoretical tryptic peptides amenable for mass spectrometry analysis specific for either UniProt, PGX, or both.
Figure 4.
Figure 4.
Sensitivity and specificity of mass spectrometry search algorithms. (A, B) Comparison of unique theoretical peptides in the experimental PGX proteome, canonical UniProt, and bacterial proteomes used as negative controls. (C) Sensitivity of tested algorithms expressed as the number of identified peptides. (D) Specificity of tested algorithms evaluated from the fraction of peptide–spectrum matches mapped to the negative controls. The PSM fraction mapped to A. loki is reported both in absolute terms (black) and normalized to take into account the relative sizes of the human and archaebacterial proteomes (shown in gray). Normalization was performed by multiplying the number of human peptides by the ratio of the A. loki and H. sapiens databases, expressed as the number of tryptic peptides.
Figure 5.
Figure 5.
Accurate proteome discovery using statistical target–decoy matching with spectral calibration. (A) Number of peptides identified (FDR < 0.01) based on matching spectra from K052 proteome against proteogenomic (PGX, red) and canonical (UniProt, gray) databases. (B) Overlap between the peptides identified in PGX (red) and UniProt (gray) databases. (C) Comparison of PEAKS scores for peptides identified in both PGX and UniProt databases. (D) PEAKS score distribution for peptides identified exclusively in PGX (red) and UniProt (gray) databases. (E) For peptides exclusively identified against the PGX database, PEAKS score distributions for peptides not mapping in UniProt (red) or present in the canonical database (gray). Boxes delimit the 25th and 75th percentiles, the middle line corresponds to the median, and whiskers correspond to the 5th and 95th percentiles. (F) PEAKS score distributions for peptides identified exclusively in PGX but also mapping in UniProt (gray) or exclusively mapping in the PGX database (red).
Figure 6.
Figure 6.
Identification of non-canonical protein isoforms using ProteomeGenerator. (A) Genome tracks of non-canonical APEH isoform generation via inclusion of an intronic sequence normally spliced in the canonical APEH isoform. (B) The K052-specific isoform of APEH contains a novel N-terminal sequence, with the splicing junction encompassed by peptide AGPDPGVSPAQVLLSEPEEAAALYR. Residues 35–276 of the protein sequences defined by ProteomeGenerator are identical to residues 4–245 of the canonical UniProt protein sequence. (C) Fragmentation spectrum of the peptide encompassing the novel splice junction, with diagnostic fragment ions and amino acid residues labeled. Italicized ion labels indicate ions with relative intensity above 25% of the maximum. Asterisks denote internal ions. (D) Peptide abundance as based on total fragment ion current for all identified APEH peptides (red: peptides from K052-specific sequence; gray: peptides from canonical sequence).

Similar articles

See all similar articles

Cited by 4 articles

MeSH terms

Feedback