MolData, a molecular benchmark for disease and target based machine learning

Arash Keshavarzi Arshadi; Milad Salem; Arash Firouzbakht; Jiann Shiun Yuan

doi:10.1186/s13321-022-00590-y

MolData, a molecular benchmark for disease and target based machine learning

J Cheminform. 2022 Mar 7;14(1):10. doi: 10.1186/s13321-022-00590-y.

Authors

Arash Keshavarzi Arshadi^#¹, Milad Salem^#², Arash Firouzbakht³, Jiann Shiun Yuan²

Affiliations

¹ Burnett School of Biomedical Sciences, University of Central Florida, Orlando, FL, USA. arashka@knights.ucf.edu.
² Department of Electrical and Computer Engineering, University of Central Florida, Orlando, FL, USA.
³ Department of Chemistry, University of Illinois at Urbana, Champaign, IL, USA.

^# Contributed equally.

Abstract

Deep learning's automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files.

Keywords: Artificial intelligence; Benchmark; Big data; Biological assays; Database; Drug discovery; Machine learning; PubChem.