Motivation: Granzyme B (GrB) and caspases cleave specific protein substrates to induce apoptosis in virally infected and neoplastic cells. While substrates for both types of proteases have been determined experimentally, there are many more yet to be discovered in humans and other metazoans. Here, we present a bioinformatics method based on support vector machine (SVM) learning that identifies sequence and structural features important for protease recognition of substrate peptides and then uses these features to predict novel substrates. Our approach can act as a convenient hypothesis generator, guiding future experiments by high-confidence identification of peptide-protein partners.
Results: The method is benchmarked on the known substrates of both protease types, including our literature-curated GrB substrate set (GrBah). On these benchmark sets, the method outperforms a number of other methods that consider sequence only, predicting at a 0.87 true positive rate (TPR) and a 0.13 false positive rate (FPR) for caspase substrates, and a 0.79 TPR and a 0.21 FPR for GrB substrates. The method is then applied to approximately 25 000 proteins in the human proteome to generate a ranked list of predicted substrates of each protease type. Two of these predictions, AIF-1 and SMN1, were selected for further experimental analysis, and each was validated as a GrB substrate.
Availability: All predictions for both protease types are publically available at http://salilab.org/peptide. A web server is at the same site that allows a user to train new SVM models to make predictions for any protein that recognizes specific oligopeptide ligands.