Classifying alkaliphilic proteins using embeddings from protein language model

Meredita Susanty; Muhammad Khaerul Naim Mursalim; Rukman Hertadi; Ayu Purwarianti; Tati LE Rajab

doi:10.1016/j.compbiomed.2024.108385

Classifying alkaliphilic proteins using embeddings from protein language model

Comput Biol Med. 2024 May:173:108385. doi: 10.1016/j.compbiomed.2024.108385. Epub 2024 Mar 26.

Authors

Meredita Susanty¹, Muhammad Khaerul Naim Mursalim², Rukman Hertadi³, Ayu Purwarianti⁴, Tati LE Rajab⁵

Affiliations

¹ Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas Pertamina, School of Computer Science, Jl Teuku Nyak Arief Jakarta Selatan DKI Jakarta, Indonesia.
² Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas Universal, Kompleks Maha Vihara Duta Maitreya Bukit Beruntung, Sei Panas Batam, 29456, Kepulauan Riau, Indonesia.
³ Institut Teknologi Bandung Faculty of Math and Natural Sciences, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia.
⁴ Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Center for Artificial Intelligence (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung, Indonesia.
⁵ Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia. Electronic address: tati@stei.itb.ac.id.

PMID: 38547659
DOI: 10.1016/j.compbiomed.2024.108385

Abstract

Alkaliphilic proteins have great potential as biocatalysts in biotechnology, especially for enzyme engineering. Extensive research has focused on exploring the enzymatic potential of alkaliphiles and characterizing alkaliphilic proteins. However, the current method employed for identifying these proteins that requires web lab experiment is time-consuming, labor-intensive, and expensive. Therefore, the development of a computational method for alkaliphilic protein identification would be invaluable for protein engineering and design. In this study, we present a novel approach that uses embeddings from a protein language model called ESM-2(3B) in a deep learning framework to classify alkaliphilic and non-alkaliphilic proteins. To our knowledge, this is the first attempt to employ embeddings from a pre-trained protein language model to classify alkaliphilic protein. A reliable dataset comprising 1,002 alkaliphilic and 1,866 non-alkaliphilic proteins was constructed for training and testing the proposed model. The proposed model, dubbed ALPACA, achieves performance scores of 0.88, 0.84, and 0.75 for accuracy, f1-score, and Matthew correlation coefficient respectively on independent dataset. ALPACA is likely to serve as a valuable resource for exploring protein alkalinity and its role in protein design and engineering.

Keywords: Alkaliphilic protein; Classification; Embeddings; Protein disorder.

MeSH terms

Animals
Camelids, New World*
Language
Proteins

Substances

Proteins