Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Dec 15;28(24):3225-31.
doi: 10.1093/bioinformatics/bts613. Epub 2012 Oct 16.

Reference-independent Comparative Metagenomics Using Cross-Assembly: crAss

Affiliations
Free PMC article

Reference-independent Comparative Metagenomics Using Cross-Assembly: crAss

Bas E Dutilh et al. Bioinformatics. .
Free PMC article

Abstract

Motivation: Metagenomes are often characterized by high levels of unknown sequences. Reads derived from known microorganisms can easily be identified and analyzed using fast homology search algorithms and a suitable reference database, but the unknown sequences are often ignored in further analyses, biasing conclusions. Nevertheless, it is possible to use more data in a comparative metagenomic analysis by creating a cross-assembly of all reads, i.e. a single assembly of reads from different samples. Comparative metagenomics studies the interrelationships between metagenomes from different samples. Using an assembly algorithm is a fast and intuitive way to link (partially) homologous reads without requiring a database of reference sequences.

Results: Here, we introduce crAss, a novel bioinformatic tool that enables fast simple analysis of cross-assembly files, yielding distances between all metagenomic sample pairs and an insightful image displaying the similarities.

Figures

Fig. 1.
Fig. 1.
Distance between 31 simulated metagenomic samples with increasing species overlap, and simulated sample ov00 (see Supplementary File 1 for species distributions). Distances were calculated using the four crAss distance formulas; the fifth line shows the distance based on dinucleotide odds ratios (Willner et al., 2009). See Section 2 for details
Fig. 2.
Fig. 2.
In cladograms of nine simulated metagenomes, those containing Actinobacteria (n = 3) mostly form a separate cluster from those containing Firmicutes (n = 6; Supplementary Fig. S2). The length of the separating internal branch is plotted with increasing Proteobacteria contamination, until the Actinobacteria and Firmicutes metagenomes no longer form separate clusters, and there is no internal branch. See text for details. The cladograms created by crAss are available in Supplementary Figure S2
Fig. 3.
Fig. 3.
Cross-assembled metagenomic reads from one human nasal sample and two human fecal samples. Each gray diamond represents a contig. The X, Y and Z coordinates indicate the number of incorporated reads from the metagenomes mentioned along the axes. Note that zero values are set to 0.9 so they can be displayed on the logarithmic plot. Small black dots are the projections of the diamonds onto the planes, but superimposed for visibility. A triangle plot of the same data is also available. These graphs can be retrieved at http://edwards.sdsu.edu/crass/ under Job ID 1329506771
Fig. 4.
Fig. 4.
Cladogram representing the distance between metagenomes based on the fraction of cross-assembled contigs between all sample pairs. crAss creates this cladogram from a distance matrix using BioNJ (Gascuel, 1997) and visualizes it using Drawtree (Felsenstein, 1989). This cladogram was based on Equation (1). The complete output for this dataset, including distance matrices and cladograms based on the other distance formulas, can be retrieved at http://edwards.sdsu.edu/crass/ under Job ID 1329505996

Similar articles

See all similar articles

Cited by 27 articles

See all "Cited by" articles

References

    1. Amari S.-I. Differential geometry of curved exponential families-curvatures and information loss. Ann. Stat. 1982;10:357–385.
    1. Angly F., et al. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics. 2005;6:41. - PMC - PubMed
    1. Angly F.E., et al. The marine viromes of four oceanic regions. PLoS Biol. 2006;4:e368. - PMC - PubMed
    1. Angly F.E., et al. Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res. 2012;40:e94. - PMC - PubMed
    1. Balzer S., et al. Systematic exploration of error sources in pyrosequencing flowgram data. Bioinformatics. 2011;27:i304–i309. - PMC - PubMed

Publication types

Feedback