In silico design of smaller size enzymatic protein by generative artificial intelligence (ProtGPT2)

J Biosci Bioeng. 2025 Sep;140(3):174-179. doi: 10.1016/j.jbiosc.2025.06.009. Epub 2025 Jul 10.

Abstract

The construction of small proteins by removing amino acid subsequences that are not involved in function, activity, or structure is crucial for bioprocessing and drug development. Traditional design methods often focus on reconstructing functional motifs, but they face challenges in stabilizing structure and reproducing function. In this study, we aimed to develop a design method for small proteins using ProtGPT2, a model that generates protein sequences based on function and structure. First, amino acid sequence data of malate dehydrogenase (MDH) was collected, and ProtGPT2 was fine-tuned (ProtGPT2 for MDH). The chain length and perplexity (ppl) of the generated sequences were evaluated, producing shorter sequences than the natural ones. The validity of the generated sequences was assessed using both population and individual analyses. Population analysis, including multiple sequence alignment (MSA) and t-distributed stochastic neighbor embedding (tSNE), revealed that ProtGPT2 for MDH identified functional motifs of MDH and incorporated them into the generated sequences. Additionally, tSNE showed that the generated sequences were highly similar to natural MDH sequences. In individual analysis, 10 randomly selected sequences were evaluated using BLAST, AlphaFold2, and InterPro. BLAST indicated that 9 sequences were novel MDH variants. AlphaFold2 confirmed that their 3D structures were highly similar to known MDH structures. InterPro identified domains and active sites in 2 sequences, suggesting that they were novel, small MDH variants. In conclusion, ProtGPT2 for MDH has the potential to design amino acid sequence candidates for small MDHs. The validity and utility of the model will be established through future experimental efforts.

Keywords: Amino acid sequence design; Generative artificial intelligence; Malate dehydrogenase; ProtGPT2; Protein language model; Smaller size protein.

MeSH terms

  • Amino Acid Motifs
  • Amino Acid Sequence
  • Artificial Intelligence*
  • Computer Simulation
  • Generative Artificial Intelligence
  • Malate Dehydrogenase* / chemistry
  • Malate Dehydrogenase* / genetics
  • Models, Molecular
  • Protein Engineering* / methods
  • Sequence Alignment

Substances

  • Malate Dehydrogenase