Electronic van der Waals surface property descriptors and genetic algorithms for developing structure-activity correlations in olfactory databases

J Chem Inf Comput Sci. Nov-Dec 2003;43(6):1890-905. doi: 10.1021/ci030016j.


A methodology to facilitate the intelligent design of new odorants (e.g., musks) with specialized properties has been developed as part of an ongoing research effort in machine learning. In a traditional framework, the introduction of a new odorant is a lengthy, costly, and laborious discovery, development, and testing process. We propose to streamline this process utilizing large existing olfactory databases available through the open scientific literature as input for a new structure/activity correlation methodology. The first step in this process is to characterize each molecule in the database by an appropriate set of descriptors. To accomplish this task, an enhanced version of Breneman's Transferable Atom Equivalent (TAE) descriptor methodology will be used to create a large set of electron density derived shape/property hybrid (PEST), wavelet coefficient (WCD), and TAE histogram descriptors. We have chosen these molecular property descriptors to represent the problem because they have been shown to contain pertinent shape and electronic properties of the molecule and correlate with key modes of intermolecular interactions. Traditional QSAR methodologies, which employ fragment based descriptors, have been shown to be effective for QSAR development within homologous sets of molecules but are less effective when applied to data sets containing a great deal of structural variation. In contrast to previous attempts at SAR, our use of shape-aware electron density based molecular property descriptors has removed many of the limitations brought about by the use of descriptors based on substructure fragments, molecular surface properties, or other whole molecule descriptors. Another reason for the mixed success of past QSAR efforts can be traced to the nature of the underlying modeling problem, which is often quite complex. To meet these challenges, a genetic algorithm for pattern recognition analysis has been developed that selects descriptors which create class separation in a plot of the two largest principal components of the data while simultaneously searching for features that increase clustering of the data.