Background: Epidermal Growth Factor Receptor (EGFR) is a well-characterized cancer drug target. In the past, several QSAR models have been developed for predicting inhibition activity of molecules against EGFR. These models are useful to a limited set of molecules for a particular class like quinazoline-derivatives. In this study, an attempt has been made to develop prediction models on a large set of molecules (~3500 molecules) that include diverse scaffolds like quinazoline, pyrimidine, quinoline and indole.
Results: We train, test and validate our classification models on a dataset called EGFR10 that contains 508 inhibitors (having inhibition activity IC50 less than 10 nM) and 2997 non-inhibitors. Our Random forest based model achieved maximum MCC 0.49 with accuracy 83.7% on a validation set using 881 PubChem fingerprints. In this study, frequency-based feature selection technique has been used to identify best fingerprints. It was observed that PubChem fingerprints FP380 (C(~O) (~O)), FP579 (O = C-C-C-C), FP388 (C(:C) (:N) (:N)) and FP 816 (ClC1CC(Br)CCC1) are more frequent in the inhibitors in comparison to non-inhibitors. In addition, we created different datasets namely EGFR100 containing inhibitors having IC50 < 100 nM and EGFR1000 containing inhibitors having IC50 < 1000 nM. We trained, test and validate our models on datasets EGFR100 and EGFR1000 datasets and achieved and maximum MCC 0.58 and 0.71 respectively. In addition, models were developed for predicting quinazoline and pyrimidine based EGFR inhibitors.
Conclusions: In summary, models have been developed on a large set of molecules of various classes for discriminating EGFR inhibitors and non-inhibitors. These highly accurate prediction models can be used to design and discover novel EGFR inhibitors. In order to provide service to the scientific community, a web server/standalone EGFRpred also has been developed ( http://crdd.osdd.net/oscadd/egfrpred/ ).