Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters

Front Bioinform. 2023 Mar 3:3:1157956. doi: 10.3389/fbinf.2023.1157956. eCollection 2023.


Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.

Keywords: biodiversity; cluster annotation; metagenomes; metatranscriptomes; microbial dark matter; protein clustering; protein families.

Publication types

  • Review

Grants and funding

GP, FB, and EK were supported by HFRI (first call of research projects to support faculty members and researchers, Grant: HFRI-FM17-1855-BOLOGNA) and the Fondation Santé. EP was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “second Call for H.F.R.I. Research Projects to support Faculty Members and Researchers” (Project Number: 2772). DP-E and NK were supported by the U.S. Department of Energy Joint Genome Institute (https://ror.org/04xm1d337), a DOE Office of Science User Facility, supported by the Office of Science of the U.S. Department of Energy operated under Contract No. DE-AC02-05CH11231. FB was also supported by Fondation Santé.