Kullback Leibler divergence in complete bacterial and phage genomes

PeerJ. 2017 Nov 30:5:e4026. doi: 10.7717/peerj.4026. eCollection 2017.

Abstract

The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback-Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.

Keywords: Genometrics; Genomics; Information theory; Metagenomics.

Grants and funding

This work was supported by the PhAnToMe grant from the National Science Foundation (NSF) Division of Biological Infrastructure (DBI-0850356 to Robert A. Edwards), which also partly covered Sajia Akhter and Ramy K. Aziz while at SDSU. Robert A. Edwards is also supported by NSF grant MCB-1330800. Ramy K. Aziz is partly funded by Faculty of Pharmacy, Cairo University, Grant IRG-2015-2. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.