Machine Learning Strategy That Leverages Large Data sets to Boost Statistical Power in Small-Scale Experiments

J Proteome Res. 2020 Mar 6;19(3):1267-1274. doi: 10.1021/acs.jproteome.9b00780. Epub 2020 Feb 17.


Machine learning methods have proven invaluable for increasing the sensitivity of peptide detection in proteomics experiments. Most modern tools, such as Percolator and PeptideProphet, use semisupervised algorithms to learn models directly from the data sets that they analyze. Although these methods are effective for many proteomics experiments, we suspected that they may be suboptimal for experiments of smaller scale. In this work, we found that the power and consistency of Percolator results were reduced as the size of the experiment was decreased. As an alternative, we propose a different operating mode for Percolator: learn a model with Percolator from a large data set and use the learned model to evaluate the small-scale experiment. We call this a "static modeling" approach, in contrast to Percolator's usual "dynamic model" that is trained anew for each data set. We applied this static modeling approach to two settings: small, gel-based experiments and single-cell proteomics. In both cases, static models increased the yield of detected peptides and eliminated the model-induced variability of the standard dynamic approach. These results suggest that static models are a powerful tool for bringing the full benefits of Percolator and other semisupervised algorithms to small-scale experiments.

Keywords: SVM; bioinformatics; confidence estimation; machine learning; peptide identification; percolator; proteomics; single-cell mass spectrometry; support vector machine; tandem mass spectrometry.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Databases, Protein
  • Machine Learning
  • Proteomics
  • Software*
  • Tandem Mass Spectrometry*