DNA-Encoded Libraries allow for an efficient approach to synthesize and screen billions of small molecules against a target of interest. With more real-world binding data, this can improve training of machine learning models. However, one key challenge in DELs is the severe imbalances between the classes, in other words, there are much more inactive than active compounds against any given target. This can heavily skew the training process. In this study, we explore different undersampling strategies for the majority class. These different techniques are benchmarked against random selection and prototyped on two different DEL datasets with three different machine learning models. Overall, the max_sim strategy shows the best scores, and the general pipeline is implemented in the DELight package.
Keywords: algorithms; cluster chemistry; molecular simulation.