Strategies for high-throughput comparative modeling: applications to leverage analysis in structural genomics and protein family organization

Proteins. 2007 Mar 1;66(4):766-77. doi: 10.1002/prot.21191.

Abstract

The technological breakthroughs in structural genomics were designed to facilitate the solution of a sufficient number of structures, so that as many protein sequences as possible can be structurally characterized with the aid of comparative modeling. The leverage of a solved structure is the number and quality of the models that can be produced using the structure as a template for modeling and may be viewed as the "currency" with which the success of a structural genomics endeavor can be measured. Moreover, the models obtained in this way should be valuable to all biologists. To this end, at the Northeast Structural Genomics Consortium (NESG), a modular computational pipeline for automated high-throughput leverage analysis was devised and used to assess the leverage of the 186 unique NESG structures solved during the first phase of the Protein Structure Initiative (January 2000 to July 2005). Here, the results of this analysis are presented. The number of sequences in the nonredundant protein sequence database covered by quality models produced by the pipeline is approximately 39,000, so that the average leverage is approximately 210 models per structure. Interestingly, only 7900 of these models fulfill the stringent modeling criterion of being at least 30% sequence-identical to the corresponding NESG structures. This study shows how high-throughput modeling increases the efficiency of structure determination efforts by providing enhanced coverage of protein structure space. In addition, the approach is useful in refining the boundaries of structural domains within larger protein sequences, subclassifying sequence diverse protein families, and defining structure-based strategies specific to a particular family.

Publication types

  • Comparative Study
  • Research Support, N.I.H., Extramural

MeSH terms

  • Computational Biology / statistics & numerical data
  • Crystallography, X-Ray
  • Genomics
  • Models, Biological*
  • Nuclear Magnetic Resonance, Biomolecular
  • Protein Conformation
  • Proteins / chemistry
  • Proteins / classification
  • Proteins / genetics*
  • Proteins / metabolism*

Substances

  • Proteins