ProteomeGenerator: A Framework for Comprehensive Proteomics Based on de Novo Transcriptome Assembly and High-Accuracy Peptide Mass Spectral Matching

J Proteome Res. 2018 Nov 2;17(11):3681-3692. doi: 10.1021/acs.jproteome.8b00295. Epub 2018 Oct 19.


Modern mass spectrometry now permits genome-scale and quantitative measurements of biological proteomes. However, analysis of specific specimens is currently hindered by the incomplete representation of biological variability of protein sequences in canonical reference proteomes and the technical demands for their construction. Here, we report ProteomeGenerator, a framework for de novo and reference-assisted proteogenomic database construction and analysis based on sample-specific transcriptome sequencing and high-accuracy mass spectrometry proteomics. This enables the assembly of proteomes encoded by actively transcribed genes, including sample-specific protein isoforms resulting from non-canonical mRNA transcription, splicing, or editing. To improve the accuracy of protein isoform identification in non-canonical proteomes, ProteomeGenerator relies on statistical target-decoy database matching calibrated using sample-specific controls. Its current implementation includes automatic integration with MaxQuant mass spectrometry proteomics algorithms. We applied this method for the proteogenomic analysis of splicing factor SRSF2 mutant leukemia cells, demonstrating high-confidence identification of non-canonical protein isoforms arising from alternative transcriptional start sites, intron retention, and cryptic exon splicing as well as improved accuracy of genome-scale proteome discovery. Additionally, we report proteogenomic performance metrics for current state-of-the-art implementations of SEQUEST HT, MaxQuant, Byonic, and PEAKS mass spectral analysis algorithms. Finally, ProteomeGenerator is implemented as a Snakemake workflow within a Singularity container for one-step installation in diverse computing environments, thereby enabling open, scalable, and facile discovery of sample-specific, non-canonical, and neomorphic biological proteomes.

Keywords: de novo database construction; peptide fractionation; peptide−spectral matching; protein isoform analysis; proteogenomics; scoring function; transcriptomics.

MeSH terms

  • Algorithms*
  • Alternative Splicing
  • Amino Acid Sequence
  • Cell Line, Tumor
  • Humans
  • Leukocytes / metabolism
  • Leukocytes / pathology
  • Mass Spectrometry / statistics & numerical data
  • Molecular Sequence Annotation
  • Mutation
  • Peptide Mapping / statistics & numerical data
  • Peptides / chemistry*
  • Peptides / classification
  • Peptides / isolation & purification
  • Proteogenomics / methods
  • Proteogenomics / statistics & numerical data
  • Proteome
  • Proteomics / methods*
  • RNA, Messenger / genetics*
  • RNA, Messenger / metabolism
  • Serine-Arginine Splicing Factors / genetics
  • Serine-Arginine Splicing Factors / metabolism
  • Software*
  • Transcriptome*


  • Peptides
  • Proteome
  • RNA, Messenger
  • SRSF2 protein, human
  • Serine-Arginine Splicing Factors