Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 8;49(D1):D490-D497.
doi: 10.1093/nar/gkaa812.

BiG-FAM: the biosynthetic gene cluster families database

Affiliations

BiG-FAM: the biosynthetic gene cluster families database

Satria A Kautsar et al. Nucleic Acids Res. .

Abstract

Computational analysis of biosynthetic gene clusters (BGCs) has revolutionized natural product discovery by enabling the rapid investigation of secondary metabolic potential within microbial genome sequences. Grouping homologous BGCs into Gene Cluster Families (GCFs) facilitates mapping their architectural and taxonomic diversity and provides insights into the novelty of putative BGCs, through dereplication with BGCs of known function. While multiple databases exist for exploring BGCs from publicly available data, no public resources exist that focus on GCF relationships. Here, we present BiG-FAM, a database of 29,955 GCFs capturing the global diversity of 1,225,071 BGCs predicted from 209,206 publicly available microbial genomes and metagenome-assembled genomes (MAGs). The database offers rich functionalities, such as multi-criterion GCF searches, direct links to BGC databases such as antiSMASH-DB, and rapid GCF annotation of user-supplied BGCs from antiSMASH results. BiG-FAM can be accessed online at https://bigfam.bioinformatics.nl.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Constructed based on a large-scale homology analysis of 1.2 million biosynthetic gene clusters, BiG-FAM provides a platform to explore their genomic diversity and to discover their relationships to newly sequenced ones.
Figure 1.
Figure 1.
(A) Pie chart depicting the ratio of five generic BGC classes within the full dataset across three different microbial kingdoms (a total of 1,224,563 BGCs, excluding 16 plant BGCs from MIBiG and 492 BGCs with unassigned taxonomy). (B) Taxa covered by BGCs in BiG-FAM (total number of unique taxa represented by at least one BGC-containing genome per taxonomy level), with the total number of BGCs per kingdom provided in the far-right column of the table.
Figure 2.
Figure 2.
Workflow schema of BiG-FAM’s architecture. Starting from a collection of ∼1.2 million BGCs, BiG-SLiCE was used to perform a clustering analysis (with threshold parameter T = 900), resulting in (A) 29,955 GCFs stored in an SQLite3 database file. This file is used as the ‘Core database’ for BiG-FAM. To support BiG-FAM’s extensive functionalities, three related database files were created, each managed by a specific module in the software package. (B) The ‘Precalculated database’ summarizes complex SQL operations (i.e., calculation of taxonomy counts per GCF) to speed up page loads (detailed schema and procedures can be accessed from the ‘precalculation’ module in BiG-FAM’s source code). (C) The ‘Queries database’ stores information related to user-submitted BGC queries, such as processed features from antiSMASH BGCs and the corresponding list of best-matched GCFs identified using BiG-SLiCE. (D) Finally, the ‘Linkage database’ keeps tab on the cross-links to external databases (i.e. MIBiG and antiSMASH-DB), storing information such as the accession number of each linked BGC, which can be used to generate the correct URL addresses pointing to the correct entry within the specific database. These modules and databases were used to serve an online database written in Python using the Flask programming library.
Figure 3.
Figure 3.
(A) By clicking on the ‘GCF’ page link (box 1) from the main menu, users will be provided with an interface to search GCFs based on multiple criteria; in this case we search for ‘bacterial GCFs harboring AS-TIGR03973 and Radical_SAM biosynthetic domains in at least ∼80% of their BGCs’ (box 2). (B) After applying the filter function (box 3), BiG-FAM returned a list of 79 GCFs satisfying the criteria. (C) Clicking on the ‘view’ button of a GCF (box 4) will take users to a detail page that shows several statistics related to the GCF’s taxonomic distribution, length of its BGCs, and features (domains) distribution. (D) In the GCF detail page, users may also choose to view an ‘arrower’ visualization of the BGCs (box 5), which in this case shows the occurrence of neighboring biosynthetic genes (depicted in colored arrows) flanking the queried cysteine-rich precursor + rSAM gene pairs (blue boxes).
Figure 4.
Figure 4.
(A) When users click on the ‘Query’ section of the main menu (box 1), they will be presented with a form to input the job ID of a finished antiSMASH run. After pressing ‘Submit’, BiG-FAM will immediately execute (or put into queue) the downloading, preprocessing and GCF matching of all BGCs (i.e. regions) included in the submitted run. (B) A list will then be shown with the summary of all best BGC-to-GCF pairings with distance lower than 900 (original threshold value) highlighted in green, depicting a good match to at least one GCF in the database. A particular query BGC, ‘Region 15.1’ was selected for a detailed look (box 3) as mentioned in the main text. (C) A list of five best-matching GCFs and their model distances to the query BGC, showing an exact match (d = 0) to a singleton GCF from Streptomyces (GCF_24649, box 4) which turned out to be the same BGC from the same genome. Looking at the visualization of the second closest GCF on the list (GCF_06303 with d = 1609, box 5), we can see (D) co-occurrence of protein domains across the distantly related BGCs, where some similar but non-identical PKS genes (longest multi-domain gene in each GCF) seems to act as an ‘anchor’ that defines the GCF. While this group of anchor genes have a similar domain architecture to the PKS gene of the queried BGC (box 6), a quick BLASTp analysis against one example gene (box 7) shows only 52.63% amino acid identity (Supplementary Text 1). Along with the differences in non-PKS genes between the query BGC and the gene clusters in the GCF, this suggests that, while the BGC is (distantly) related to this GCF, it does not actually belong to it and constitutes a novel gene cluster architecture.

Similar articles

Cited by

References

    1. Demain A.L. Importance of microbial natural products and the need to revitalize their discovery. J. Ind. Microbiol. Biotechnol. 2014; 41:185–201. - PubMed
    1. Vicente M.F., Basilio A., Cabello A., Peláez F.. Microbial natural products as a source of antifungals. Clin. Microbiol. Infect. 2003; 9:15–32. - PubMed
    1. Blin K., Shaw S., Steinke K., Villebro R., Ziemert N., Lee S.Y., Medema M.H., Weber T.. antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic Acids Res. 2019; 47:W81–W87. - PMC - PubMed
    1. Skinnider M.A., Merwin N.J., Johnston C.W., Magarvey N.A.. PRISM 3: expanded prediction of natural product chemical structures from microbial genomes. Nucleic Acids Res. 2017; 45:W49–W54. - PMC - PubMed
    1. Blin K., Pascal Andreu V., de Los Santos E.L.C., Del Carratore F., Lee S.Y., Medema M.H., Weber T.. The antiSMASH database version 2: a comprehensive resource on secondary metabolite biosynthetic gene clusters. Nucleic Acids Res. 2019; 47:D625–D630. - PMC - PubMed

Publication types