Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies

Bioinformatics. 2022 Jan 3;38(2):469-475. doi: 10.1093/bioinformatics/btab659.

Abstract

Motivation: The aim of quantitative structure-activity prediction (QSAR) studies is to identify novel drug-like molecules that can be suggested as lead compounds by means of two approaches, which are discussed in this article. First, to identify appropriate molecular descriptors by focusing on one feature-selection algorithms; and second to predict the biological activities of designed compounds. Recent studies have shown increased interest in the prediction of a huge number of molecules, known as Big Data, using deep learning models. However, despite all these efforts to solve critical challenges in QSAR models, such as over-fitting, massive processing procedures, is major shortcomings of deep learning models. Hence, finding the most effective molecular descriptors in the shortest possible time is an ongoing task. One of the successful methods to speed up the extraction of the best features from big datasets is the use of least absolute shrinkage and selection operator (LASSO). This algorithm is a regression model that selects a subset of molecular descriptors with the aim of enhancing prediction accuracy and interpretability because of removing inappropriate and irrelevant features.

Results: To implement and test our proposed model, a random forest was built to predict the molecular activities of Kaggle competition compounds. Finally, the prediction results and computation time of the suggested model were compared with the other well-known algorithms, i.e. Boruta-random forest, deep random forest and deep belief network model. The results revealed that improving output correlation through LASSO-random forest leads to appreciably reduced implementation time and model complexity, while maintaining accuracy of the predictions.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Big Data*
  • Data Analysis
  • Quantitative Structure-Activity Relationship*