Highly accurate and consistent method for prediction of helix and strand content from primary protein sequences

Artif Intell Med. 2005 Sep-Oct;35(1-2):19-35. doi: 10.1016/j.artmed.2005.02.006.

Abstract

Objective: One of interesting computational topics in bioinformatics is prediction of secondary structure of proteins. Over 30 years of research has been devoted to the topic but we are still far away from having reliable prediction methods. A critical piece of information for accurate prediction of secondary structure is the helix and strand content of a given protein sequence. Ability to accurately predict content of those two secondary structures has a good potential to improve accuracy of prediction of the secondary structure. Most of the existing methods use composition vector to predict the content. Their underlying assumption is that the vector can be used to provide functional mapping between primary sequence and helix/strand content. While this is true for small sets of proteins we show that for larger protein sets such mapping are inconsistent, i.e. the same composition vectors correspond to different contents. To this end, we propose a method for prediction of helix/strand content from primary protein sequences that is fundamentally different from currently available methods.

Methods and material: Our method is accurate and uses a novel approach to obtain information from primary sequence based on a composition moment vector, which is a measure that includes information about both composition of a given primary sequence and the position of amino acids in the sequence. In contrast to the composition vector, we show that it provides functional mapping between primary sequence and the helix/strand content.

Results: A set of benchmarks involving a large protein dataset consisting of over 11,000 protein sequences from Protein Data Bank was performed to validate the method. Prediction done by a neural network had average accuracy of 91.5% for the helix and 94.5% for the strand contents. We also show that using the new measure results in about 40% reduction of error rates when compared with the composition vector results.

Conclusions: The developed method has much better accuracy when compared with other existing methods, as shown on a large body of proteins, in contrast to other reported results that often target small sets of specific protein types, such as globular proteins.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Databases, Protein
  • Linear Models
  • Models, Molecular*
  • Neural Networks, Computer
  • Protein Conformation
  • Protein Structure, Secondary*
  • Sequence Alignment / methods
  • Sequence Analysis, Protein / methods*