MMseqs software suite for fast and deep clustering and searching of large protein sequence sets
- PMID: 26743509
- DOI: 10.1093/bioinformatics/btw006
MMseqs software suite for fast and deep clustering and searching of large protein sequence sets
Abstract
Motivation: Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence databases by similarity clustering improves speed and sensitivity of iterative searches. But existing tools cannot efficiently cluster databases of the size of UniProt to 50% maximum pairwise sequence identity or below. Furthermore, in metagenomics experiments typically large fractions of reads cannot be matched to any known sequence anymore because searching with sensitive but relatively slow tools (e.g. BLAST or HMMER3) through comprehensive databases such as UniProt is becoming too costly.
Results: MMseqs (Many-against-Many sequence searching) is a software suite for fast and deep clustering and searching of large datasets, such as UniProt, or 6-frame translated metagenomics sequencing reads. MMseqs contains three core modules: a fast and sensitive prefiltering module that sums up the scores of similar k-mers between query and target sequences, an SSE2- and multi-core-parallelized local alignment module, and a clustering module.In our homology detection benchmarks, MMseqs is much more sensitive and 4-30 times faster than UBLAST and RAPsearch, respectively, although it does not reach BLAST sensitivity yet. Using its cascaded clustering workflow, MMseqs can cluster large databases down to ∼30% sequence identity at hundreds of times the speed of BLASTclust and much deeper than CD-HIT and USEARCH. MMseqs can also update a database clustering in linear instead of quadratic time. Its much improved sensitivity-speed trade-off should make MMseqs attractive for a wide range of large-scale sequence analysis tasks.
Availability and implementation: MMseqs is open-source software available under GPL at https://github.com/soedinglab/MMseqs
Contact: martin.steinegger@mpibpc.mpg.de, soeding@mpibpc.mpg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Similar articles
-
kClust: fast and sensitive clustering of large protein sequence databases.BMC Bioinformatics. 2013 Aug 15;14:248. doi: 10.1186/1471-2105-14-248. BMC Bioinformatics. 2013. PMID: 23945046 Free PMC article.
-
Faster sequence homology searches by clustering subsequences.Bioinformatics. 2015 Apr 15;31(8):1183-90. doi: 10.1093/bioinformatics/btu780. Epub 2014 Nov 27. Bioinformatics. 2015. PMID: 25432166 Free PMC article.
-
HH-suite3 for fast remote homology detection and deep protein annotation.BMC Bioinformatics. 2019 Sep 14;20(1):473. doi: 10.1186/s12859-019-3019-7. BMC Bioinformatics. 2019. PMID: 31521110 Free PMC article.
-
Clustered sequence representation for fast homology search.J Comput Biol. 2007 Jun;14(5):594-614. doi: 10.1089/cmb.2007.R005. J Comput Biol. 2007. PMID: 17683263 Review.
-
Nucleic acid and protein sequence databases.Comput Appl Biosci. 1985;1(1):11-7. doi: 10.1093/bioinformatics/1.1.11. Comput Appl Biosci. 1985. PMID: 3916889 Review.
Cited by
-
Bacterial death and TRADD-N domains help define novel apoptosis and immunity mechanisms shared by prokaryotes and metazoans.Elife. 2021 Jun 1;10:e70394. doi: 10.7554/eLife.70394. Elife. 2021. PMID: 34061031 Free PMC article.
-
Infant gut DNA bacteriophage strain persistence during the first 3 years of life.Cell Host Microbe. 2024 Jan 10;32(1):35-47.e6. doi: 10.1016/j.chom.2023.11.015. Epub 2023 Dec 13. Cell Host Microbe. 2024. PMID: 38096814 Free PMC article.
-
Functional host-specific adaptation of the intestinal microbiome in hominids.Nat Commun. 2024 Jan 6;15(1):326. doi: 10.1038/s41467-023-44636-7. Nat Commun. 2024. PMID: 38182626 Free PMC article.
-
Differential Gene Expression of Mucor lusitanicus under Aerobic and Anaerobic Conditions.J Fungi (Basel). 2022 Apr 15;8(4):404. doi: 10.3390/jof8040404. J Fungi (Basel). 2022. PMID: 35448635 Free PMC article.
-
AlphaFold illuminates half of the dark human proteins.Curr Opin Struct Biol. 2022 Jun;74:102372. doi: 10.1016/j.sbi.2022.102372. Epub 2022 Apr 16. Curr Opin Struct Biol. 2022. PMID: 35439658 Free PMC article. Review.
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
