Mapping of the Available Chemical Space versus the Chemical Universe of Lead-Like Compounds

ChemMedChem. 2018 Mar 20;13(6):540-554. doi: 10.1002/cmdc.201700561. Epub 2018 Jan 29.


This is, to our knowledge, the most comprehensive analysis to date based on generative topographic mapping (GTM) of fragment-like chemical space (40 million molecules with no more than 17 heavy atoms, both from the theoretically enumerated GDB-17 and real-world PubChem/ChEMBL databases). The challenge was to prove that a robust map of fragment-like chemical space can actually be built, in spite of a limited (≪105 ) maximal number of compounds ("frame set") usable for fitting the GTM manifold. An evolutionary map building strategy has been updated with a "coverage check" step, which discards manifolds failing to accommodate compounds outside the frame set. The evolved map has a good propensity to separate actives from inactives for more than 20 external structure-activity sets. It was proven to properly accommodate the entire collection of 40 m compounds. Next, it served as a library comparison tool to highlight biases of real-world molecules (PubChem and ChEMBL) versus the universe of all possible species represented by FDB-17, a fragment-like subset of GDB-17 containing 10 million molecules. Specific patterns, proper to some libraries and absent from others (diversity holes), were highlighted.

Keywords: computer chemistry; generative topographic mapping; library comparison; molecular diversity; structure analysis.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Databases, Chemical*
  • Molecular Structure
  • Small Molecule Libraries / chemistry*


  • Small Molecule Libraries