Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Oct 18;8(10):e76910.
doi: 10.1371/journal.pone.0076910. eCollection 2013.

Two new computational methods for universal DNA barcoding: a benchmark using barcode sequences of bacteria, archaea, animals, fungi, and land plants

Affiliations

Two new computational methods for universal DNA barcoding: a benchmark using barcode sequences of bacteria, archaea, animals, fungi, and land plants

Akifumi S Tanabe et al. PLoS One. .

Erratum in

Abstract

Taxonomic identification of biological specimens based on DNA sequence information (a.k.a. DNA barcoding) is becoming increasingly common in biodiversity science. Although several methods have been proposed, many of them are not universally applicable due to the need for prerequisite phylogenetic/machine-learning analyses, the need for huge computational resources, or the lack of a firm theoretical background. Here, we propose two new computational methods of DNA barcoding and show a benchmark for bacterial/archeal 16S, animal COX1, fungal internal transcribed spacer, and three plant chloroplast (rbcL, matK, and trnH-psbA) barcode loci that can be used to compare the performance of existing and new methods. The benchmark was performed under two alternative situations: query sequences were available in the corresponding reference sequence databases in one, but were not available in the other. In the former situation, the commonly used "1-nearest-neighbor" (1-NN) method, which assigns the taxonomic information of the most similar sequences in a reference database (i.e., BLAST-top-hit reference sequence) to a query, displays the highest rate and highest precision of successful taxonomic identification. However, in the latter situation, the 1-NN method produced extremely high rates of misidentification for all the barcode loci examined. In contrast, one of our new methods, the query-centric auto-k-nearest-neighbor (QCauto) method, consistently produced low rates of misidentification for all the loci examined in both situations. These results indicate that the 1-NN method is most suitable if the reference sequences of all potentially observable species are available in databases; otherwise, the QCauto method returns the most reliable identification results. The benchmark results also indicated that the taxon coverage of reference sequences is far from complete for genus or species level identification in all the barcode loci examined. Therefore, we need to accelerate the registration of reference barcode sequences to apply high-throughput DNA barcoding to genus or species level identification in biodiversity research.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Schematic illustration of the relationship between query and reference sequences.
A query sequence (filled circle) and reference sequences similar to the query sequence (open circle) are shown. The range of nucleotide variation of the genus formula image (gray area) is shown with reference sequences of species formula image and formula image in the genus (A and B, respectively). Distance between the sequences represents genetic distance in the schematic two-dimensional space. (a) A case in which our new criterion works well. The query falls within the nucleotide variation range of genus formula image. (b) A case in which our new criterion might produce misidentification. Because the genetic distance between a query sequence and the sequence similar to it (A) is smaller than the genetic distance between sequence A and sequence B, the query sequence will be assigned to the genus formula image under our new criterion.
Figure 2
Figure 2. Schematic illustration of the NNCauto and QCauto methods.
The processes of the NNCauto method are summarized as follows: (a) By a BLAST-search of a query sequence (Q), a nearest-neighbor sequence (A) is retrieved. (b) By a BLAST-search of A, a borderline sequence (B) is retrieved. (c) By an additional BLAST-search of A, all neighborhood sequences (open circles) are retrieved. Finally, the query is identified at the lowest taxonomic level where the taxonomic information of all the neighborhood sequences including A and B is consistent with each other (i.e., lowest common ancestor algorithm [21]). In the QCauto method, the processes a and b are shared with the NNCauto method, but neighborhood sequences are retrieved by a BLAST-search of Q (d). After the search of neighborhood sequences, the query is identified by the LCA algorithm as in the NNCauto method. A bidirectional arrow indicates genetic distance between two sequences, and a dotted circle represents the range of nucleotide variation that meets the requirement of a BLAST-search.
Figure 3
Figure 3. Frequencies of correctness scores in the no-LOOCV of full-length query sets.
The number of correctly identified taxonomic levels is used as an index representing the degree of correctness of taxonomic assignment. This correctness index has the maximum value 6 when the taxonomic information at all the phylum/division, class, order, family, genus, and species levels is correctly assigned. On the other hand, the index has the minimum value 0 when taxonomic information at all the six taxonomic levels is erroneously assigned to a query or a query remains unidentified even at the phylum/division level. 1NN, 5NN, 97%, 99%, Bar, CNJ, RDP, NNC, and QC means 1-NN, 5-NN, 97%-NN, 99%-NN, Barcoder, ConstrainedNJ, RDPClassifier, NNCauto, and QCauto methods, respectively. (a) Animal COX1. (b) Bacterial/Archaeal 16S. (c) Fungal ITS. (d) Plant matK. (e) Plant rbcL. (f) Plant trnH-psbA.
Figure 4
Figure 4. Frequencies of incorrectness scores in the no-LOOCV of full-length query sets.
The number of incorrectly identified taxonomic levels is used as an index representing the degree of incorrectness of taxonomic assignment. This incorrectness index has the maximum value 6 when the taxonomic assignment of all the six taxonomic levels is incorrect. On the other hand, the index has the minimum value 0 when the taxonomic assignment does not return incorrect results at any taxonomic level: note that this includes the situation in which a query is unidentified even at the phylum/division level. 1NN, 5NN, 97%, 99%, Bar, CNJ, RDP, NNC, and QC represent the 1-NN, 5-NN, 97%-NN, 99%-NN, Barcoder, ConstrainedNJ, RDPClassifier, NNCauto, and QCauto methods, respectively. (a) Animal COX1. (b) Bacterial/Archaeal 16S. (c) Fungal ITS. (d) Plant matK. (e) Plant rbcL. (f) Plant trnH-psbA.
Figure 5
Figure 5. Frequencies of correctness scores in the LOOCV of full-length query sets.
1NN, 5NN, 97%, 99%, Bar, CNJ, RDP, NNC, and QC represent the 1-NN, 5-NN, 97%-NN, 99%-NN, Barcoder, ConstrainedNJ, RDPClassifier, NNCauto, and QCauto methods, respectively. (a) Animal COX1. (b) Bacterial/Archaeal 16S. (c) Fungal ITS. (d) Plant matK. (e) Plant rbcL. (f) Plant trnH-psbA. See the caption of Fig. 3 for the explanation of the correctness index.
Figure 6
Figure 6. Frequencies of incorrectness scores in the LOOCV of full-length query sets.
1NN, 5NN, 97%, 99%, Bar, CNJ, RDP, NNC, and QC represent the 1-NN, 5-NN, 97%-NN, 99%-NN, Barcoder, ConstrainedNJ, RDPClassifier, NNCauto, and QCauto methods, respectively. (a) Animal COX1. (b) Bacterial/Archaeal 16S. (c) Fungal ITS. (d) Plant matK. (e) Plant rbcL. (f) Plant trnH-psbA. See the caption of Fig. 4 for the explanation of the incorrectness index.

Similar articles

Cited by

References

    1. Cardinale BJ, Duffy JE, Gonzalez A, Hooper DU, Perrings C, et al. (2012) Biodiversity loss and its impact on humanity. Nature 486: 59–67. - PubMed
    1. Primack RB (1993) Essentials of conservation biology. Sunderland, MA: Sinauer Associates.
    1. CBOL Plant Working Group (2009) A DNA barcode for land plants. Proceedings of the National Academy of Sciences of the United States of America 106: 12794–12797. - PMC - PubMed
    1. Hebert PDN, Penton EH, Burns JM, Janzen DH, Hallwachs W (2004) Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator . Proceedings of the National Academy of Sciences of the United States of America 101: 14812–14817. - PMC - PubMed
    1. Human Microbiome Project Consortium (2012) Structure, function and diversity of the healthy human microbiome. Nature 486: 207–214. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

This work was supported by grant-in-aid by Funding Program for Next Generation World-Leading Researchers (GS014) by the Japan Society for the Promotion of Science. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.