Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 9;6(1):140.
doi: 10.1186/s40168-018-0521-5.

IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences

Affiliations

IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences

Adithya Murali et al. Microbiome. .

Abstract

Background: Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many microbial groups. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of "over classification" is particularly detrimental in microbiome studies because reference taxonomies are far from comprehensive.

Results: Here, we introduce IDTAXA, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors. Using multiple reference taxonomies, we demonstrate that IDTAXA has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier. Similarly, IDTAXA yields far fewer over classifications on Illumina mock microbial community data when the expected taxa are absent from the training set. Furthermore, IDTAXA offers many practical advantages over other classifiers, such as maintaining low error rates across varying input sequence lengths and withholding classifications from input sequences composed of random nucleotides or repeats.

Conclusions: IDTAXA's classifications may lead to different conclusions in microbiome studies because of the substantially reduced number of taxa that are incorrectly identified through over classification. Although misclassification error is relatively minor, we believe that many remaining misclassifications are likely caused by errors in the reference taxonomy. We describe how IDTAXA is able to identify many putative mislabeling errors in reference taxonomies, enabling training sets to be automatically corrected by eliminating spurious sequences. IDTAXA is part of the DECIPHER package for the R programming language, available through the Bioconductor repository or accessible online ( http://DECIPHER.codes ).

Keywords: 16S rRNA gene sequencing; Classification; ITS sequencing; Microbiome; Reference taxonomy; Taxonomic assignment.

PubMed Disclaimer

Conflict of interest statement

Not applicable.

Not applicable.

The authors declare that they have no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
The IDTAXA algorithm exhibits relatively low OC error rates. Plots showing error rates versus the fraction of classifiable sequences classified as confidence is varied from 100% (left) to 0% (right). A better classifier will exhibit lower error rates during leave-one-out cross-validation while classifying the same fraction of classifiable sequences, shifting its curves downward. Misclassification (MC) error rates (dashed lines) are much lower than over classification (OC) error rates (solid lines) on three different training sets: the RDP training set of full-length 16S rRNA gene sequences (a), the Contax training set (b), and the Warcup ITS training set (c). The IDTAXA algorithm consistently displays the lowest OC error rates across different training sets. MC and OC error rates are higher when testing the shorter V4 region (~ 251 nucleotides) of the RDP training set (d). Points indicate error rates at default/recommended confidence thresholds: ≥ 95% sequence identity for BLAST, ≥ 70% confidence for QIIME, ≥ 60% confidence for IDTAXA, ≥ 50% confidence for MAPSeq, and ≥ 80% confidence for all others
Fig. 2
Fig. 2
Variability in sequence similarity at the same confidence level. During leave-one-out cross-validation with the RDP training set, for each singleton sequence, we computed the distance to the nearest sequence in the group to which it was assigned. The IDTAXA algorithm only assigned a high confidence to sequences that had a low distance to the query sequence being classified. In contrast, all other k-mer approaches assigned high confidences even when all of the sequences in the group were distant to the query sequence. The curves indicate the cubic spline that best fits the data
Fig. 3
Fig. 3
Confidences assigned to random and repeat sequences. Using the RDP training set, the RDP Classifier and SINTAX assigned high confidences at the domain level (i.e., Bacteria or Archaea) to 1000 query sequences composed of 1000 random nucleotides. Similarly, both the RDP Classifier and SINTAX assigned high confidence at the genus level to 1000 sequences composed of repeats with periodicity varying from 1 (e.g., AAA...) to 7. In contrast, the IDTAXA, MAPSeq, and SPINGO algorithms assigned low confidences to random and repeat sequences at all taxonomic levels
Fig. 4
Fig. 4
Comparison of classifications using human and environmental microbiome data. The number of sequences assigned to each taxonomic group in the RDP training set is shown for full-length 16S rRNA gene sequences originating from two different environments [2]. The RDP Classifier was far more permissive at its default (≥ 80%) confidence than IDTAXA at its default (≥ 60%) confidence. Even at a 100% confidence threshold, the RDP Classifier assigned sequences to many more groups than the IDTAXA algorithm, possibly because of its substantially higher OC error rate. Note that some points may be overlapping, particularly at low numbers of assigned sequences
Fig. 5
Fig. 5
Some misclassifications may be due to labeling errors. Many misclassifications (≥ 0% confidence) on the full-length RDP training set are to groups containing a sequence that has greater sequence identity than any sequence in the correct group. Extreme cases to the left of the vertical line are potentially due to labeling errors in the RDP training set
Fig. 6
Fig. 6
Result of classifying sequences with the IdTaxa function. The outputs of the IdTaxa function can be plotted with the DECIPHER package for the R programming language or exported for integration into a separate bioinformatics pipeline. The pie chart shows the distribution of IDTAXA classifications for 268,930 full-length 16S rRNA gene sequences from a human gut sample [2]

Similar articles

Cited by

References

    1. Nussinov R, Papin JA. How can computation advance microbiome research? PLoS Comput Biol. 2017;13:e1005547. doi: 10.1371/journal.pcbi.1005547. - DOI - PMC - PubMed
    1. Karst SM, Dueholm MS, McIlroy SJ, Kirkegaard RH, Nielsen PH, Albertsen M. Retrieval of a million high-quality, full-length microbial 16S and 18S rRNA gene sequences without primer bias. Nat Biotech. 2018;36(2):190–195. doi: 10.1038/nbt.4045. - DOI - PubMed
    1. Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ, Evans PN, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol. 2017;2:1533–42. - PubMed
    1. Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng J-F, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499:431–437. doi: 10.1038/nature12352. - DOI - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. Oxford Univ Press. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources