Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

Johannes Griss; Yasset Perez-Riverol; Steve Lewis; David L Tabb; José A Dianes; Noemi Del-Toro; Marc Rurik; Mathias W Walzer; Oliver Kohlbacher; Henning Hermjakob; Rui Wang; Juan Antonio Vizcaíno

doi:10.1038/nmeth.3902

Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

Nat Methods. 2016 Aug;13(8):651-656. doi: 10.1038/nmeth.3902. Epub 2016 Jun 27.

Authors

Affiliations

¹ Division of Immunology, Allergy and Infectious Diseases, Department of Dermatology, Medical University of Vienna, Austria; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
² European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
³ Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville.
⁴ Dept. of Computer Science, University of Tübingen, Germany; Center for Bioinformatics, University of Tübingen, Germany.
⁵ Dept. of Computer Science, University of Tübingen, Germany; Center for Bioinformatics, University of Tübingen, Germany; Quantitative Biology Center, University of Tübingen, Germany; Max Planck Institute for Developmental Biology, Germany.
⁶ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom; National Center for Protein Sciences, Beijing, China.

Abstract

Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average 75% of spectra analysed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large-scale to shed a light on these unidentified spectra. PRoteomics IDEntifications database (PRIDE) Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in PRIDE Archive, coming from hundreds of datasets, we were able to consistently characterize three distinct groups of spectra: 1) incorrectly identified spectra, 2) spectra correctly identified but below the set scoring threshold, and 3) truly unidentified spectra. Using a multitude of complementary analysis approaches, we were able to identify less than 20% of the consistently unidentified spectra. The complete spectrum clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.

Abstract

Grants and funding