RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification

Genome Biol. 2018 Oct 30;19(1):165. doi: 10.1186/s13059-018-1554-6.

Abstract

In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.

Keywords: Comparative analysis; LCA; Metagenomics; Microbiome; Reference database; Taxonomic classification; k-mer.

Publication types

  • Letter
  • Research Support, N.I.H., Intramural
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Bacillus / genetics
  • Computer Simulation
  • Databases, Genetic*
  • Genetic Variation
  • Metagenome
  • Sequence Analysis, DNA*
  • Species Specificity
  • Time Factors

Associated data

  • figshare/10.6084/m9.figshare.7090697