Prediction of aqueous solubility based on large datasets using several QSPR models utilizing topological structure representation

Chem Biodivers. 2004 Nov;1(11):1829-41. doi: 10.1002/cbdv.200490137.


Several QSPR models were developed for predicting intrinsic aqueous solubility, S(o). A data set of 5,964 neutral compounds was sub-divided into two classes, aromatic and non-aromatic compounds. Three models were created with different methods on both data sets: two regression models (multiple linear regression and partial least squares) and an artificial neural network model. These models were based on 3343 aromatic and 1674 non-aromatic compounds for training sets; 938 compounds were used in external validation testing. The range in -log S(o) is -1.6 to 10. Topological structure descriptors were used with all models. A genetic algorithm was used for descriptor selection for regression models. For the artificial neural network (ANN) model, descriptor selection was done with a backward elimination process. All models performed well with r2 values ranging 0.72 to 0.84 in external validation testing. The mean absolute errors in validation ranged from 0.44 to 0.80 for the classes of compounds for all the models. These statistical results indicate a sound ANN model. Furthermore, in a comparison with eight other available models, based on predictions using a validation test set (442 compounds), the artificial neural network model presented in this work (CSLogWS) was clearly superior based on both the mean absolute error and the percentage of residuals less than one log unit. In the ANN model both E-State and hydrogen E-State descriptors were found to be important.

MeSH terms

  • Databases, Factual*
  • Models, Molecular*
  • Molecular Structure
  • Predictive Value of Tests
  • Quantitative Structure-Activity Relationship*
  • Solubility
  • Water / chemistry*


  • Water