Background: High-accuracy prediction tools are essential in the post-genomic era to define organellar proteomes in their full complexity. We recently applied a discriminative machine learning approach to predict plant proteins carrying peroxisome targeting signals (PTS) type 1 from genome sequences. For Arabidopsis thaliana 392 gene models were predicted to be peroxisome-targeted. The predictions were extensively tested in vivo, resulting in a high experimental verification rate of Arabidopsis proteins previously not known to be peroxisomal.
Results: In this study, we experimentally validated the predictions in greater depth by focusing on the most challenging Arabidopsis proteins with unknown non-canonical PTS1 tripeptides and prediction scores close to the threshold. By in vivo subcellular targeting analysis, three novel PTS1 tripeptides (QRL>, SQM>, and SDL>) and two novel tripeptide residues (Q at position -3 and D at pos. -2) were identified. To understand why, among many Arabidopsis proteins carrying the same C-terminal tripeptides, these proteins were specifically predicted as peroxisomal, the residues upstream of the PTS1 tripeptide were computationally permuted and the changes in prediction scores were analyzed. The newly identified Arabidopsis proteins were found to contain four to five amino acid residues of high predicted targeting enhancing properties at position -4 to -12 in front of the non-canonical PTS1 tripeptide. The identity of the predicted targeting enhancing residues was unexpectedly diverse, comprising besides basic residues also proline, hydroxylated (Ser, Thr), hydrophobic (Ala, Val), and even acidic residues.
Conclusions: Our computational and experimental analyses demonstrate that the plant PTS1 tripeptide motif is more diverse than previously thought, including an increasing number of non-canonical sequences and allowed residues. Specific targeting enhancing elements can be predicted for particular sequences of interest and are far more diverse in amino acid composition and positioning than previously assumed. Machine learning methods become indispensable to predict which specific proteins, among numerous candidate proteins carrying the same non-canonical PTS1 tripeptide, contain sufficient enhancer elements in terms of number, positioning and total strength to cause peroxisome targeting.