Protein design involves searching a vast space for sequences that are compatible with a defined structure. This can pose significant computational challenges. Cluster expansion is a technique that can accelerate the evaluation of protein energies by generating a simple functional relationship between sequence and energy. The method consists of several steps. First, for a given protein structure, a training set of sequences with known energies is generated. Next, this training set is used to expand energy as a function of clusters consisting of single residues, residue pairs, and higher order terms, if required. The accuracy of the sequence-based expansion is monitored and improved using cross-validation testing and iterative inclusion of additional clusters. As a trade-off for evaluation speed, the cluster-expansion approximation causes prediction errors, which can be reduced by including more training sequences, including higher order terms in the expansion, and/or reducing the sequence space described by the cluster expansion. This article analyzes the sources of error and introduces a method whereby accuracy can be improved by judiciously reducing the described sequence space. The method is applied to describe the sequence-stability relationship for several protein structures: coiled-coil dimers and trimers, a PDZ domain, and T4 lysozyme as examples with computationally derived energies, and SH3 domains in amphiphysin-1 and endophilin-1 as examples where the expanded pseudo-energies are obtained from experiments. Our open-source software package Cluster Expansion Version 1.0 allows users to expand their own energy function of interest and thereby apply cluster expansion to custom problems in protein design.
© 2010 Wiley Periodicals, Inc.