Analysing and Navigating Natural Products Space for Generating Small, Diverse, But Representative Chemical Libraries

Biotechnol J. 2018 Jan;13(1). doi: 10.1002/biot.201700503. Epub 2017 Dec 6.


Armed with the digital availability of two natural products libraries, amounting to some 195 885 molecular entities, we ask the question of how we can best sample from them to maximize their "representativeness" in smaller and more usable libraries of 96, 384, 1152, and 1920 molecules. The term "representativeness" is intended to include diversity, but for numerical reasons (and the likelihood of being able to perform a QSAR) it is necessary to focus on areas of chemical space that are more highly populated. Encoding chemical structures as fingerprints using the RDKit "patterned" algorithm, we first assess the granularity of the natural products space using a simple clustering algorithm, showing that there are major regions of "denseness" but also a great many very sparsely populated areas. We then apply a "hybrid" hierarchical K-means clustering algorithm to the data to produce more statistically robust clusters from which representative and appropriate numbers of samples may be chosen. There is necessarily again a trade-off between cluster size and cluster number, but within these constraints, libraries containing 384 or 1152 molecules can be found that come from clusters that represent some 18 and 30% of the whole chemical space, with cluster sizes of, respectively, 50 and 27 or above, just about sufficient to perform a QSAR. By using the online availability of molecules via the Molport system (, we are also able to construct (and, for the first time, provide the contents of) a small virtual library of available molecules that provided effective coverage of the chemical space described. Consistent with this, the average molecular similarities of the contents of the libraries developed is considerably smaller than is that of the original libraries. The suggested libraries may have use in molecular or phenotypic screening, including for determining possible transporter substrates.

Keywords: cheminformatics; drug transporters; encodings; endogenites; maximum common substructure; metabolomics.

MeSH terms

  • Algorithms
  • Biological Products / chemistry*
  • Biological Products / classification
  • Drug Discovery
  • Models, Molecular
  • Molecular Structure*
  • Quantitative Structure-Activity Relationship
  • Small Molecule Libraries / chemistry*
  • Small Molecule Libraries / classification
  • Small Molecule Libraries / therapeutic use


  • Biological Products
  • Small Molecule Libraries