ApoPred: Identification of Apolipoproteins and Their Subfamilies With Multifarious Features

Ting Liu; Jia-Mao Chen; Dan Zhang; Qian Zhang; Bowen Peng; Lei Xu; Hua Tang

doi:10.3389/fcell.2020.621144

ApoPred: Identification of Apolipoproteins and Their Subfamilies With Multifarious Features

Front Cell Dev Biol. 2021 Jan 8:8:621144. doi: 10.3389/fcell.2020.621144. eCollection 2020.

Authors

Ting Liu¹, Jia-Mao Chen¹, Dan Zhang², Qian Zhang¹, Bowen Peng³, Lei Xu⁴, Hua Tang^{1

5}

Affiliations

¹ School of Basic Medical Sciences, Southwest Medical University, Luzhou, China.
² Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
³ Division of international Cooperation, Health Commission of Sichuan Province, Chengdu, China.
⁴ School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China.
⁵ Central Nervous System Drug Key Laboratory of Sichuan Province, Luzhou, China.

Abstract

Apolipoprotein is a group of plasma proteins that are associated with a variety of diseases, such as hyperlipidemia, atherosclerosis, Alzheimer's disease, and diabetes. In order to investigate the function of apolipoproteins and to develop effective targets for related diseases, it is necessary to accurately identify and classify apolipoproteins. Although it is possible to identify apolipoproteins accurately through biochemical experiments, they are expensive and time-consuming. This work aims to establish a high-efficiency and high-accuracy prediction model for recognition of apolipoproteins and their subfamilies. We firstly constructed a high-quality benchmark dataset including 270 apolipoproteins and 535 non-apolipoproteins. Based on the dataset, pseudo-amino acid composition (PseAAC) and composition of k-spaced amino acid pairs (CKSAAP) were used as input vectors. To improve the prediction accuracy and eliminate redundant information, analysis of variance (ANOVA) was used to rank the features. And the incremental feature selection was utilized to obtain the best feature subset. Support vector machine (SVM) was proposed to construct the classification model, which could produce the accuracy of 97.27%, sensitivity of 96.30%, and specificity of 97.76% for discriminating apolipoprotein from non-apolipoprotein in 10-fold cross-validation. In addition, the same process was repeated to generate a new model for predicting apolipoprotein subfamilies. The new model could achieve an overall accuracy of 95.93% in 10-fold cross-validation. According to our proposed model, a convenient webserver called ApoPred was established, which can be freely accessed at http://tang-biolab.com/server/ApoPred/service.html. We expect that this work will contribute to apolipoprotein function research and drug development in relevant diseases.

Keywords: apolipoprotein; identification; machine learning; multiple features; subfamily-classification.