Assigning amplicon sequences to operational taxonomic units (OTUs) is an important step in characterizing microbial communities across large data sets. A notable difference between de novo clustering and database-dependent reference clustering methods is that OTU assignments from de novo methods may change when new sequences are added. However, one may wish to incorporate new samples to previously clustered data sets without clustering all sequences again, such as when comparing across data sets or deploying machine learning models. Existing reference-based methods produce consistent OTUs but only consider the similarity of each query sequence to a single reference sequence in an OTU, resulting in assignments that are worse than those generated by de novo methods. To provide an efficient method to fit sequences to existing OTUs, we developed the OptiFit algorithm. Inspired by the de novo OptiClust algorithm, OptiFit considers the similarity of all pairs of reference and query sequences to produce OTUs of the best possible quality. We tested OptiFit using four data sets with two strategies: (i) clustering to a reference database and (ii) splitting the data set into a reference and query set, clustering the references using OptiClust, and then clustering the queries to the references. The result is an improved implementation of reference-based clustering. OptiFit produces OTUs of a quality similar to that of OptiClust at faster speeds when using the split data set strategy. OptiFit provides a suitable option for users requiring consistent OTU assignments at the same quality as afforded by de novo clustering methods. IMPORTANCE Advancements in DNA sequencing technology have allowed researchers to affordably generate millions of sequence reads from microorganisms in diverse environments. Efficient and robust software tools are needed to assign microbial sequences into taxonomic groups for characterization and comparison of communities. The OptiClust algorithm produces high-quality groups by comparing sequences to each other, but the assignments can change when new sequences are added to a data set, making it difficult to compare different studies. Other approaches assign sequences to groups by comparing them to sequences in a reference database to produce consistent assignments, but the quality of the groups produced is reduced compared to that with OptiClust. We developed OptiFit, a new reference-based algorithm that produces consistent yet high-quality assignments like OptiClust. OptiFit allows researchers to compare microbial communities across different studies or add new data to existing studies without sacrificing the quality of the group assignments.
Keywords: 16S rRNA gene; bioinformatics; clustering; metagenomics; microbial ecology; microbiome.