Scientific workflow optimization for improved peptide and protein identification

BMC Bioinformatics. 2015 Sep 3;16(1):284. doi: 10.1186/s12859-015-0714-x.

Abstract

Background: Peptide-spectrum matching is a common step in most data processing workflows for mass spectrometry-based proteomics. Many algorithms and software packages, both free and commercial, have been developed to address this task. However, these algorithms typically require the user to select instrument- and sample-dependent parameters, such as mass measurement error tolerances and number of missed enzymatic cleavages. In order to select the best algorithm and parameter set for a particular dataset, in-depth knowledge about the data as well as the algorithms themselves is needed. Most researchers therefore tend to use default parameters, which are not necessarily optimal.

Results: We have applied a new optimization framework for the Taverna scientific workflow management system (http://ms-utils.org/Taverna_Optimization.pdf) to find the best combination of parameters for a given scientific workflow to perform peptide-spectrum matching. The optimizations themselves are non-trivial, as demonstrated by several phenomena that can be observed when allowing for larger mass measurement errors in sequence database searches. On-the-fly parameter optimization embedded in scientific workflow management systems enables experts and non-experts alike to extract the maximum amount of information from the data. The same workflows could be used for exploring the parameter space and compare algorithms, not only for peptide-spectrum matching, but also for other tasks, such as retention time prediction.

Conclusion: Using the optimization framework, we were able to learn about how the data was acquired as well as the explored algorithms. We observed a phenomenon identifying many ammonia-loss b-ion spectra as peptides with N-terminal pyroglutamate and a large precursor mass measurement error. These insights could only be gained with the extension of the common range for the mass measurement error tolerance parameters explored by the optimization framework.

MeSH terms

  • Algorithms*
  • Computational Biology / methods
  • Computational Biology / standards*
  • Databases, Protein
  • Humans
  • Mass Spectrometry
  • Models, Statistical
  • Peptide Fragments / analysis*
  • Peptide Fragments / chemistry
  • Programming Languages
  • Proteins / analysis*
  • Proteins / chemistry
  • Proteomics / methods*
  • Software*
  • User-Computer Interface
  • Workflow*

Substances

  • Peptide Fragments
  • Proteins