Conservation of gene cassettes among diverse viruses of the human gut

PLoS One. 2012;7(8):e42342. doi: 10.1371/journal.pone.0042342. Epub 2012 Aug 10.


Viruses are a crucial component of the human microbiome, but large population sizes, high sequence diversity, and high frequencies of novel genes have hindered genomic analysis by high-throughput sequencing. Here we investigate approaches to metagenomic assembly to probe genome structure in a sample of 5.6 Gb of gut viral DNA sequence from six individuals. Tests showed that a new pipeline based on DeBruijn graph assembly yielded longer contigs that were able to recruit more reads than the equivalent non-optimized, single-pass approach. To characterize gene content, the database of viral RefSeq proteins was compared to the assembled viral contigs, generating a bipartite graph with functional cassettes linking together viral contigs, which revealed a high degree of connectivity between diverse genomes involving multiple genes of the same functional class. In a second step, open reading frames were grouped by their co-occurrence on contigs in a database-independent manner, revealing conserved cassettes of co-oriented ORFs. These methods reveal that free-living bacteriophages, while usually dissimilar at the nucleotide level, often have significant similarity at the level of encoded amino acid motifs, gene order, and gene orientation. These findings thus connect contemporary metagenomic analysis with classical studies of bacteriophage genomic cassettes. Software is available at

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Bacteriophages / genetics
  • Bacteriophages / metabolism
  • Computational Biology / methods
  • Conserved Sequence
  • Contig Mapping
  • Genome, Viral
  • Humans
  • Intestines / microbiology
  • Intestines / virology*
  • Metagenome / genetics*
  • Open Reading Frames
  • Viral Proteins / genetics


  • Viral Proteins