P-glycoprotein substrate models using support vector machines based on a comprehensive data set

J Chem Inf Model. 2011 Jun 27;51(6):1447-56. doi: 10.1021/ci2001583. Epub 2011 Jun 3.


P-glycoprotein (P-gp) is one of the major ABC transporters and involved in many essential processes such as lipid and steroid transport across cell membranes but also in the uptake of drugs such as HIV protease and reverse transcriptase inhibitors. Despite its importance, reliable models predicting substrates of P-gp are scarce. In this study, we have built several computational models to predict whether or not a compound is a P-gp substrate, based on the largest data set yet published, employing 332 distinct structures. Each molecule is represented by ADRIANA.Code, MOE, and ECFP_4 fingerprint descriptors. The models are computed using a support vector machine based on a training set which includes 131 substrates and 81 nonsubstrates that were evaluated by 5-, 10-fold, and leave-one-out (LOO) cross-validation. The best model gives a Matthews Correlation Coefficient of 0.73 and a prediction accuracy of 0.88 on the test set. Examination of the model based on ECFP_4 fingerprints revealed several substructures which could have significance in separating substrates and nonsubstrates of P-gp, such as the nitrile and sulfoxide functional groups which have a higher frequency in nonsubstrates than in substrates. In addition structural isomerism in sugars was found to result in remarkable differences regarding the likelihood of a compound to be a substrate for P-gp.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • ATP Binding Cassette Transporter, Subfamily B, Member 1 / metabolism*
  • Artificial Intelligence*
  • Computational Biology / methods*
  • Reproducibility of Results


  • ATP Binding Cassette Transporter, Subfamily B, Member 1