Deep Learning to Predict the Biosynthetic Gene Clusters in Bacterial Genomes

J Mol Biol. 2022 Aug 15;434(15):167597. doi: 10.1016/j.jmb.2022.167597. Epub 2022 May 6.

Abstract

Biosynthetic gene clusters (BGCs) in bacterial genomes code for important small molecules and secondary metabolites. Based on the validated BGCs and the corresponding sequences of protein family domains (Pfams), Pfam functions and clan information, we develop a deep learning method e-DeepBGC, that extends DeepBGC, for detecting the BGCs and their biosynthetic class in bacterial genomes. We show that e-DeepBGC leads to reduced false positive rates in BGC identification and an increased sensitivity in identifying BGCs compared to DeepBGC. We apply e-DeepBGC to 5,666 Ref Seq bacterial genomes and detect a total of 170, 685 BGCs with an average of 30.1 BGCs in each genome. We summarize all the predicted BGCs, their functional classes and the distributions of the BGCs in different bacterial phyla.

Keywords: data augmentation; functional microbiome; long short-term memory RNN; protein family domains.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Bacteria* / genetics
  • Bacteria* / metabolism
  • Biosynthetic Pathways* / genetics
  • Deep Learning*
  • Genome, Bacterial* / genetics
  • Multigene Family* / genetics