Estimating and improving protein interaction error rates

Proc IEEE Comput Syst Bioinform Conf. 2004:216-23. doi: 10.1109/csb.2004.1332435.


High throughput protein interaction data sets have proven to be notoriously noisy. Although it is possible to focus on interactions with higher reliability by using only those that are backed up by two or more lines of evidence, this approach invariably throws out the majority of available data. A more optimal use could be achieved by incorporating the probabilities associated with all available interactions into the analysis. We present a novel method for estimating error rates associated with specific protein interaction data sets, as well as with individual interactions given the data sets in which they appear. As a bonus, we also get an estimate for the total number of protein interactions in yeast. Certain types of false positive results can be identified and removed, resulting in a significant improvement in quality of the data set. For co-purification data sets, we show how we can reach a tradeoff between the "spoke" and "matrix" representation of interactions within co-purified groups of proteins to achieve an optimal false positive error rate.

Publication types

  • Comparative Study
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Computer Simulation
  • Data Interpretation, Statistical
  • Gene Expression Profiling / methods*
  • Models, Biological*
  • Models, Statistical
  • Protein Interaction Mapping / methods*
  • Proteins / metabolism*
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Two-Hybrid System Techniques


  • Proteins