Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 31;21(3):944.
doi: 10.3390/ijms21030944.

Unique k-mers as Strain-Specific Barcodes for Phylogenetic Analysis and Natural Microbiome Profiling

Affiliations

Unique k-mers as Strain-Specific Barcodes for Phylogenetic Analysis and Natural Microbiome Profiling

Valery V Panyukov et al. Int J Mol Sci. .

Abstract

The need for a comparative analysis of natural metagenomes stimulated the development of new methods for their taxonomic profiling. Alignment-free approaches based on the search for marker k-mers turned out to be capable of identifying not only species, but also strains of microorganisms with known genomes. Here, we evaluated the ability of genus-specific k-mers to distinguish eight phylogroups of Escherichia coli (A, B1, C, E, D, F, G, B2) and assessed the presence of their unique 22-mers in clinical samples from microbiomes of four healthy people and four patients with Crohn's disease. We found that a phylogenetic tree inferred from the pairwise distance matrix for unique 18-mers and 22-mers of 124 genomes was fully consistent with the topology of the tree, obtained with concatenated aligned sequences of orthologous genes. Therefore, we propose strain-specific "barcodes" for rapid phylotyping. Using unique 22-mers for taxonomic analysis, we detected microbes of all groups in human microbiomes; however, their presence in the five samples was significantly different. Pointing to the intraspecies heterogeneity of E. coli in the natural microflora, this also indicates the feasibility of further studies of the role of this heterogeneity in maintaining population homeostasis.

Keywords: alignment-free algorithms; bacterial genomes; genome barcodes; human microbiome; k-mers; metagenomes; phylogenetic trees; phylotyping; taxonomic profiling.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The size of the “unique genomes” represented by k-mers of different length for eight individual E. coli chromosomes, and the degree of their intersection exemplified by three indicated genomes. (a) The solid lines show the normalized per 1 Mbp in each genome number of k-mers (N), found in the chromosomes of E. coli (strains: K-12 MG1655, ETEC H10407, O26:H11 str. 11368, ABU 83972, APEC O78, str. 042, O157:H7 str. EC4115 and O7:K1 str. CE10) that are absent in the nucleotide sequences of the reference database. Dashed lines show the increment curves plotted for ΔN/Δk. (b) Venn diagram illustrating the intersection between the sets of 18-mers identified in the genomes of two bacteria from group A (E. coli K-12 MG1655 and ETEC H10407) and the E. coli O26:H11 str. 11368, belonging to group B1. The number of unique 18-mers in each genome, the size of their common set and the intersection between the two sets of group A are indicated without normalization. The diagram was created using a Venn Diagram Maker [54].
Figure 2
Figure 2
Phylogenetic tree for 124 E. coli strains inferred from concatenated aligned sequences of 27 genes in the IQ-TREE program [70] using the maximum likelihood method. The optimal model for nucleotide substitution was GTR+G+I (the general time-reversible model assuming a fixed portion of invariant sites and evolutionary rate differences described by the gamma-distribution). The branch support level shown in percentage was estimated based on 2000 iterations with ultrafast bootstrap approximation [71]. The scale bar corresponds to the number of nucleotide substitutions per site. The color code corresponds to eight indicated phylogroups. The names of all strains are indicated near corresponding branches and separated with comma for identical sequences in group B1.
Figure 3
Figure 3
Phylogenetic tree constructed by the neighbor-joining method in the MEGA X program [73]. The tree was inferred from the pairwise distance matrix for 124 sets of 18-mers unique to the genera Escherichia/Shigella and was identical to the tree constructed on the basis of 22-mers. The set of marker 18-mers from the genome of Escherichia albertii KF1 was used as the outgroup sample. The scale bar shows the Sorensen distance as a percentage. The same color code as in Figure 2 denotes the clades of eight phylogroups.
Figure 4
Figure 4
Phylogroup-dependent taxonomy of metagenomes from four healthy individuals (numbers 1–4) and four patients with Crohn’s disease (numbers 5–8). Panel (a) shows the size distribution for cumulative sets of unique 22-mers (colored symbols) and selected metagenomes numbered in the same way as in panel “b” (open symbols). Panel (b) demonstrates the number of sequence reads assigned to a particular group, normalized by the size of cumulative sets of 22-mers (Table 1) and the number of reads in metagenomes. Numerical values in both cases are presented as their natural logarithms.

Similar articles

Cited by

References

    1. Woese C.R., Fox G.E. Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proc. Natl. Acad. Sci. USA. 1977;74:5088–5090. doi: 10.1073/pnas.74.11.5088. - DOI - PMC - PubMed
    1. Wang Q., Garrity G.M., Tiedje J.M., Cole J.R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 2007;73:5261–5267. doi: 10.1128/AEM.00062-07. - DOI - PMC - PubMed
    1. DeSantis T.Z., Hugenholtz P., Larsen N., Rojas M., Brodie E.L., Keller K., Huber T., Dalevi D., Hu P., Andersen G.L. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 2006;72:5069–5072. doi: 10.1128/AEM.03006-05. - DOI - PMC - PubMed
    1. Vetrovsky T., Baldrian P. The variability of the 16S rRNA gene in bacterial genomes and its consequences for bacterial community analyses. PLoS ONE. 2013;8:e57923. doi: 10.1371/journal.pone.0057923. - DOI - PMC - PubMed
    1. Andersson A.F., Lindberg M., Jakobsson H., Backhed F., Nyren P., Engstrand L. Comparative analysis of human gut microbiota by barcoded pyrosequencing. PLoS ONE. 2008;3:e2836. doi: 10.1371/journal.pone.0002836. - DOI - PMC - PubMed