cmFSM: a scalable CPU-MIC coordinated drug-finding tool by frequent subgraph mining

Shunyun Yang; Runxin Guo; Rui Liu; Xiangke Liao; Quan Zou; Benyun Shi; Shaoliang Peng

doi:10.1186/s12859-018-2071-z

cmFSM: a scalable CPU-MIC coordinated drug-finding tool by frequent subgraph mining

BMC Bioinformatics. 2018 May 8;19(Suppl 4):98. doi: 10.1186/s12859-018-2071-z.

Authors

Shunyun Yang¹, Runxin Guo¹, Rui Liu², Xiangke Liao¹, Quan Zou³, Benyun Shi⁴, Shaoliang Peng^{5

6}

Affiliations

¹ School of Computer Science, National University of Defense Technology, Changsha, 410073, China.
² Department of Oncology, The Second Xiangya Hospital of Central South University, Changsha, 410011, China.
³ School of Computer Science and Technology, Tianjin University, Tianjin, 300072, China. zouquan@nclab.net.
⁴ School of Cyberspace, Hangzhou Dianzi University, Hangzhou, 310018, China. benyunshi@outlook.com.
⁵ College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, Changsha, 410082, China. pengshaoliang@nudt.edu.cn.
⁶ School of Computer Science, National University of Defense Technology, Changsha, 410073, China. pengshaoliang@nudt.edu.cn.

Abstract

Background: Frequent subgraphs mining is a significant problem in many practical domains. The solution of this kind of problem can particularly used in some large-scale drug molecular or biological libraries to help us find drugs or core biological structures rapidly and predict toxicity of some unknown compounds. The main challenge is its efficiency, as (i) it is computationally intensive to test for graph isomorphisms, and (ii) the graph collection to be mined and mining results can be very large. Existing solutions often require days to derive mining results from biological networks even with relative low support threshold. Also, the whole mining results always cannot be stored in single node memory.

Results: In this paper, we implement a parallel acceleration tool for classical frequent subgraph mining algorithm called cmFSM. The core idea is to employ parallel techniques to parallelize extension tasks, so as to reduce computation time. On the other hand, we employ multi-node strategy to solve the problem of memory constraints. The parallel optimization of cmFSM is carried out on three different levels, including the fine-grained OpenMP parallelization on single node, multi-node multi-process parallel acceleration and CPU-MIC collaborated parallel optimization.

Conclusions: Evaluation results show that cmFSM clearly outperforms the existing state-of-the-art miners even if we only hold a few parallel computing resources. It means that cmFSM provides a practical solution to frequent subgraph mining problem with huge number of mining results. Specifically, our solution is up to one order of magnitude faster than the best CPU-based approach on single node and presents a promising scalability of massive mining tasks in multi-node scenario. More source code are available at:Source Code: https://github.com/ysycloud/cmFSM .

Keywords: Bioinformatics; Frequent subgraph mining; Isomorphism; Many integrated Core (MIC); Memory constraints.

MeSH terms

Algorithms*
Data Mining*
Databases as Topic
Drug Evaluation, Preclinical*
Software*