P-glycoprotein (P-gp) actively transports a wide variety of chemically diverse compounds out of cells. It is highly associated with the ADMET properties of drugs and drug candidates and, moreover, plays a major role in the multidrug resistance (MDR) phenomenon, which leads to the failure of chemotherapy in cancer treatments. Therefore, the recognition of potential P-gp substrates at the early stages of the drug discovery process is quite important. Here, we compiled an extensive data set containing 423 P-gp substrates and 399 nonsubstrates, which is the largest P-gp substrate/nonsubstrate data set yet published. Comparison of the distributions of eight important physicochemical properties for the substrates and nonsubstrates reveals that molecular weight and molecular solubility are the informative attributes differentiating P-gp substrates from nonsubstrates. Examination of the distributions of eight physicochemical properties for 735 P-gp inhibitors and 423 substrates gives the fact that inhibitors are significantly more hydrophobic than substrates while substrates tend to have more H-bond donors than inhibitors. Then, the classification models based on simple molecular properties, topological descriptors, and molecular fingerprints were developed using the naive Bayesian classification technique. The best naive Bayesian classifier yields a Matthews correlation coefficient of 0.824 and a prediction accuracy of 91.2% for the training set from a 5-fold cross-validation procedure, and a Matthews correlation coefficient of 0.667 and a prediction accuracy of 83.5% for the test set containing 200 molecules. Analysis of the important structural fragments given by the Bayesian classifier shows that the essential H-bond acceptors arranged in distinct spatial patterns and flexibility are quite essential for P-gp substrate-likeness, which affords a deeper understanding on the molecular basis of substrate/P-gp interaction. Finally, the reasons for mispredictions were discussed. It turns out that the presented classifier could be used as a reliable virtual screening tool for identifying potential substrates of P-gp.
Keywords: ADME; ADMET; P-glycoprotein; fingerprint; naive Bayesian classification; substrates.