Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Filters applied. Clear all
Comparative Study
. 2008 Dec 15;24(24):2857-64.
doi: 10.1093/bioinformatics/btn546. Epub 2008 Oct 20.

Prediction of Kinase-Specific Phosphorylation Sites Using Conditional Random Fields

Affiliations
Free PMC article
Comparative Study

Prediction of Kinase-Specific Phosphorylation Sites Using Conditional Random Fields

Thanh Hai Dang et al. Bioinformatics. .
Free PMC article

Abstract

Motivation: Phosphorylation is a crucial post-translational protein modification mechanism with important regulatory functions in biological systems. It is catalyzed by a group of enzymes called kinases, each of which recognizes certain target sites in its substrate proteins. Several authors have built computational models trained from sets of experimentally validated phosphorylation sites to predict these target sites for each given kinase. All of these models suffer from certain limitations, such as the fact that they do not take into account the dependencies between amino acid motifs within protein sequences in a global fashion.

Results: We propose a novel approach to predict phosphorylation sites from the protein sequence. The method uses a positive dataset to train a conditional random field (CRF) model. The negative training dataset is used to specify the decision threshold corresponding to a desired false positive rate. Application of the method on experimentally verified benchmark phosphorylation data (Phospho.ELM) shows that it performs well compared to existing methods for most kinases. This is to our knowledge that the first report of the use of CRFs to predict post-translational modification sites in protein sequences.

Availability: The source code of the implementation, called CRPhos, is available from http://www.ptools.ua.ac.be/CRPhos/

Figures

Fig. 1.
Fig. 1.
Method for transforming an amino acid sequence to a data object of the central amino acid.
Fig. 2.
Fig. 2.
Relation between expected and observed specificity values of obtained predictor. All lines are generated using linear regression.
Fig. 3.
Fig. 3.
ROC curves of our method for some well-studied kinases, using 10-fold cross-validation (CRPhos). CRF* stands for the equivalent curve for a CRF model learned from both the positive and negative training dataset. For comparison, corresponding performance measures reported in literature are shown: PPSP (Xue et al., 2006), Scansite (Obenauer et al., 2003), NetPhosK (Blom et al., 2004), KinasePhos 1.0 (Huang et al., 2005a), KinasePhos 2.0 (Wong et al., 2007), GPS (Zhou et al., 2004) and PredPhospho (Kim et al., 2004).
Fig. 4.
Fig. 4.
Performance of CRPhos with the testing dataset that is created according to the scheme in Wan et al. (2008). The remaining dataset after removing this testing data from Phospho.ELM v.07 was used to train CRPhos. The performance measure of other existing methods, reported by Wan et al. (2008), are shown for comparison.

Similar articles

See all similar articles

Cited by 21 articles

See all "Cited by" articles

References

    1. Blom N, et al. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J. Mol. Biol. 1999;294:1351–1362. - PubMed
    1. Blom N, et al. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics. 2004;4:1633–1649. - PubMed
    1. Boeckmann B, et al. The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–370. - PMC - PubMed
    1. De Bie T, et al. Kernel-based data fusion for gene prioritization. Bioinformatics. 2007;23:i125–i132. - PubMed
    1. Diella F, et al. Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics. 2004;5:79. - PMC - PubMed

Publication types

Substances

Feedback