Microorganisms produce secondary metabolites with a remarkable range of bioactive properties. The constantly increasing amount of published genomic data provides the opportunity for efficient identification of biosynthetic gene clusters by genome mining. On the other hand, for many natural products with resolved structures, the encoding biosynthetic gene clusters have not been identified yet. Of those secondary metabolites, the scaffolds of nonribosomal peptides and polyketides (type I modular) can be predicted due to their building block-like assembly. SeMPI v2 provides a comprehensive prediction pipeline, which includes the screening of the scaffold in publicly available natural compound databases. The screening algorithm was designed to detect homologous structures even for partial, incomplete clusters. The pipeline allows linking of gene clusters to known natural products and therefore also provides a metric to estimate the novelty of the cluster if a matching scaffold cannot be found. Whereas currently available tools attempt to provide comprehensive information about a wide range of gene clusters, SeMPI v2 aims to focus on precise predictions. Therefore, the cluster detection algorithm, including building block generation and domain substrate prediction, was thoroughly refined and benchmarked, to provide high-quality scaffold predictions. In a benchmark based on 559 gene clusters, SeMPI v2 achieved comparable or better results than antiSMASH v5. Additionally, the SeMPI v2 web server provides features that can help to further investigate a submitted gene cluster, such as the incorporation of a genome browser, and the possibility to modify a predicted scaffold in a workbench before the database screening.
Keywords: machine learning; natural compounds; nonribosomal peptides; polyketides; secondary metabolites.