Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jun 29;9(1):2542.
doi: 10.1038/s41467-018-04964-5.

Clustering Huge Protein Sequence Sets in Linear Time

Affiliations
Free PMC article

Clustering Huge Protein Sequence Sets in Linear Time

Martin Steinegger et al. Nat Commun. .
Free PMC article

Abstract

Metagenomic datasets contain billions of protein sequences that could greatly enhance large-scale functional annotation and structure prediction. Utilizing this enormous resource would require reducing its redundancy by similarity clustering. However, clustering hundreds of millions of sequences is impractical using current algorithms because their runtimes scale as the input set size N times the number of clusters K, which is typically of similar order as N, resulting in runtimes that increase almost quadratically with N. We developed Linclust, the first clustering algorithm whose runtime scales as N, independent of K. It can also cluster datasets several times larger than the available main memory. We cluster 1.6 billion metagenomic sequence fragments in 10 h on a single server to 50% sequence identity, >1000 times faster than has been possible before. Linclust will help to unlock the great wealth contained in metagenomic and genomic sequence databases.

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overview of linear-time clustering algorithm. (1) For each sequence Linclust selects m k-mers (with the lowest hash function values). It sorts the k-mers alphabetically in quasi-linear time to find the groups of sequences sharing a k-mer (colored sets) and (2) it selects the longest sequence per k-mer group as center. (3,4) It compares each sequence (in three consecutively slower and more sensitive steps) only with the center sequences it shares a k-mer with, not with all sequences it shares a k-mer with. It therefore needs to compute at most m comparisons per sequence and mN in total. Pairs that pass the clustering criteria are linked by an edge. (5) The sequences are clustered in time O(mN) using a greedy incremental algorithm that finds clusters whose members all have an edge with a representative sequence. For a more details see Fig. 5
Fig. 2
Fig. 2
Linclust and Linclust/MMseqs2 manifest unique linear scaling of runtime with sequence set size. a Runtime versus input set size on linear scales. The plotting symbols indicate the sequence identity threshold for clustering of 90%, 70%, and 50%. The curves are fits with a power law, bNa. For comparison, we include runtimes of all-against-all searches using sequence search tools DIAMOND, RAPsearch2, and MASH. Runtimes were measured on a server with two Intel Xeon E5-2640v3 8-core CPUs and 128 GB RAM. b Same as (a) but on log-log scales. c Average number of sequences per cluster at 90%, 70%, and 50% sequence identity. Larger average cluster sizes imply higher sensitivities to detect similar sequences
Fig. 3
Fig. 3
Cumulative distance distribution between representative sequences. We clustered the test set of 123 million sequences at three different sequence identity thresholds (ac at 50%, 70%, and 90%, respectively). For each clustering, we randomly sampled 1000 representative cluster sequences, compared them to all representative sequences of the clustering, and plotted the fraction whose best match (excluding self-matches) with minimum sequence coverage of 90% had a sequence identity above the x-value. The y-value at the clustering threshold (dashed line) is the fraction of false negatives, pairs of sequences whose similarity was overlooked by the clustering method
Fig. 4
Fig. 4
Cluster consistency of GO molecular functional and Pfam annotations. a Cluster annotation consistency of GO functional annotations inferred from experiments (EXP_F). “Mean” and “worst” refers to the mean and the minimum annotation similarity between each representative sequence and all other cluster members. Plotting symbols indicate the sequence identity threshold for clustering. CD-HIT was only run at 90% sequence identity due to run time constraints. Linclust-m80 was only run at 50% sequence identity. b Same as (a) but using manually and computationally assigned functional GO annotations. c Consistency of Pfam annotation from the representative sequences to the cluster members
Fig. 5
Fig. 5
Linear-time clustering algorithm. Steps 1 and 2 find exact k-mer matches between the N input sequences that are extended in step 3 and 4. (1) Linclust selects in each sequence the m (default: 20) k-mers with the lowest hash function values, as this tends to select the same k-mers across homologous sequences. It uses a reduced alphabet of 13 letters for the k-mers and sets k between 10 and 14 depending on the sequence set size and the sequence identity threshold. It generates a table in which each of the mN lines consists of the k-mer, the sequence identifier, and the position of the k-mer in the sequence. (2) Linclust sorts the table by k-mer in quasi-linear time, which identifies groups of sequences sharing the same k-mer (large shaded boxes). For each k-mer group, it selects the longest sequence as center. It thereby tends to select the same sequences as center among groups sharing sequences. (3) It merges k-mer groups with the same center sequence together (1: red + cyan and 5: orange + blue) and compares each group member to the center sequence in two steps: by global Hamming distance and by gapless local alignment extending the k-mer match. (4) Sequences above a score cut-off in step 3 are aligned to their center sequence using gapped local sequence alignment. Sequence pairs that satisfy the clustering criteria (e.g., on the E-value, sequence similarity, and sequence coverage) are linked by an edge. (5) The greedy incremental algorithm finds a clustering such that each input sequence has an edge to its cluster’s representative sequence. Note that the number of sequence pairs compared in steps 3 and 4 is less than mN, resulting in a linear time complexity

Similar articles

See all similar articles

Cited by 16 articles

See all "Cited by" articles

References

    1. Rappe MS, Giovannoni SJ. The uncultured microbial majority. Ann. Rev. Microbiol. 2003;57(no. 1):369–394. doi: 10.1146/annurev.micro.57.030502.090759. - DOI - PubMed
    1. Wilke A, et al. The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res. 2016;44(no. D1):D590–D594. doi: 10.1093/nar/gkv1322. - DOI - PMC - PubMed
    1. Markowitz VM, et al. IMG/M 4 version of the integrated metagenome comparative analysis system. Nucleic Acids Res. 2014;42(no. D1):D568–D573. doi: 10.1093/nar/gkt919. - DOI - PMC - PubMed
    1. Scholz MB, Lo CC, Chain PS. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr. Opin. Biotechnol. 2012;23(no. 1):9–15. doi: 10.1016/j.copbio.2011.11.013. - DOI - PubMed
    1. Desai N, Antonopoulos D, Gilbert JA, Glass EM, Meyer F. From genomics to metagenomics. Curr. Opin. Biotechnol. 2012;23(no. 1):72–76. doi: 10.1016/j.copbio.2011.12.017. - DOI - PubMed

Publication types

Feedback