A statistical basis for testing the significance of mass spectrometric protein identification results

J Eriksson; B T Chait; D Fenyö

doi:10.1021/ac990792j

A statistical basis for testing the significance of mass spectrometric protein identification results

Anal Chem. 2000 Mar 1;72(5):999-1005. doi: 10.1021/ac990792j.

Authors

J Eriksson¹, B T Chait, D Fenyö

Affiliation

¹ The Rockefeller University, New York, New York 10021, USA.

PMID: 10739204
DOI: 10.1021/ac990792j

Abstract

A method for testing the significance of mass spectrometric (MS) protein identification results is presented. MS proteolytic peptide mapping and genome database searching provide a rapid, sensitive, and potentially accurate means for identifying proteins. Database search algorithms detect the matching between proteolytic peptide masses from an MS peptide map and theoretical proteolytic peptide masses of the proteins in a genome database. The number of masses that matches is used to compute a score, S, for each protein, and the protein that yields the best score is assumed as the identification result. There is a risk of obtaining a false result, because masses determined by MS are not unique; i.e., each mass in a peptide map can match randomly one or several proteins in a genome database. A false result is obtained when the score, S, due to random matching cannot be discerned from the score due to matching with a real protein in the sample. We therefore introduce the frequency function, f(S), for false (random) identification results as a basis for testing at what significance level, alpha, one can reject a null hypothesis, H0: "the result is false". The significance is tested by comparing an experimental score, S(E), with a critical score, S(C), required for a significant result at the level alpha. If S(E) > or = S(C), H0 is rejected. f(S) and S(C) were obtained by simulations utilizing random tryptic peptide maps generated from a genome database. The critical score, S(C), was studied as a function of the number of masses in the peptide map, the mass accuracy, the degree of incomplete enzymatic cleavage, the protein mass range, and the size of the genome. With S(C) known for a variety of experimental constraints, significance testing can be fully automated and integrated with database searching software used for protein identification.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Genome
Hydrolysis
Mass Spectrometry / standards*
Molecular Weight
Peptide Mapping
Proteins / chemistry*
Proteins / metabolism
Reproducibility of Results
Trypsin / metabolism

Substances

Proteins
Trypsin

Grants and funding

RR00862/RR/NCRR NIH HHS/United States