OptDesign: extending optimizable k-dissimilarity selection to combinatorial library design

J Chem Inf Comput Sci. May-Jun 2003;43(3):829-36. doi: 10.1021/ci025662h.

Abstract

Optimizable k-dissimilarity (OptiSim) selection entails drawing a series of subsamples of size k from a population and choosing the "best" candidate from each such subsample for inclusion in the selection set. By varying the size of the subsample, one can control the balance between representativeness and diversity in the selection set obtained. In the original formulation, a uniform random sampling from among valid candidates was used to draw the subsamples from a single target population. Here we describe in detail two key modifications that serve to extend the OptiSim methodology to vector selection for interdependent variables, specifically as applied to the design of combinatorial sublibraries. The first modification involves pivoting between variables: subsamples are drawn from each reagent pool in turn, with the viability of each candidate being evaluated in isolation as well as in terms of the products it will produce from complementary reagents already selected. The filters applied may be static or dynamic in nature, with molecular weight and hydrophobicity being examples of the former and structural diversity with respect to reagents already selected being an example of the latter. The second key modification is adding the ability to bias the selection of candidate reagents for inclusion in the subsamples. Taken together, these modifications support the efficient generation of multiblock and other sparse matrix designs that are both representative and diverse, and for which "backfilling" of designs edited to remove undesirable reagents or products is straightforward. The method is intrinsically fast and efficient, since enumeration of the full combinatorial is not required- only those candidates actually considered for inclusion need be evaluated. Moreover, because the subsample selection step is separate from the diversity-based selection of the "best" candidate, incorporating such bias in favor of a competing criterion such as low price provides a "natural," nonparametric mechanism for generating designs that are likely to be "good" in a double-objective, Pareto sense.