Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 8, 398

Strainer: Software for Analysis of Population Variation in Community Genomic Datasets


Strainer: Software for Analysis of Population Variation in Community Genomic Datasets

John M Eppley et al. BMC Bioinformatics.


Background: Metagenomic analyses of microbial communities that are comprehensive enough to provide multiple samples of most loci in the genomes of the dominant organism types will also reveal patterns of genetic variation within natural populations. New bioinformatic tools will enable visualization and comprehensive analysis of this sequence variation and inference of recent evolutionary and ecological processes.

Results: We have developed a software package for analysis and visualization of genetic variation in populations and reconstruction of strain variants from otherwise co-assembled sequences. Sequencing reads can be clustered by matching patterns of single nucleotide polymorphisms to generate predicted gene and protein variant sequences, identify conserved intergenic regulatory sequences, and determine the quantity and distribution of recombination events.

Conclusion: The Strainer software, a first generation metagenomic bioinformatics tool, facilitates comprehension and analysis of heterogeneity intrinsic in natural communities. The program reveals the degree of clustering among closely related sequence variants and provides a rapid means to generate gene and protein sequences for functional, ecological, and evolutionary analyses.


Figure 1
Figure 1
Strainer display. Screen captures of the Strainer program displaying scaffold 29 and read sequences from Ferroplasma type II community data taken from an acid mine drainage community [1]. The image in (A) shows the entire scaffold. Read alignments were determined using BLAST and are indicated with filled light-grey bars and connected to mate pairs by a thin light-grey line. Dark regions within a read indicate where the local divergence from the reference is more than 8%. The black bar at the top surrounded by a red rectangle represents the entire reference sequence (scaffold 29 in this case). The dark grey arrows immediately below indicate gene locations. Reads outlined in yellow have alignments, along with their mate pairs (not visible in this image), that are inconsistent with the size of clones. Clusters of such reads indicate an inconsistency in gene order, usually associated with transposable elements or the insertion and deleting of genes. The image in (B) shows a zoomed-in view of the same data. The red rectangle over the reference sequence bar has shrunk to indicate the location of the zoomed view. The user has selected, via a mouse click, one gene. This gene is colored white with a grey region below to indicate its extent. Reads are now white with colored tick marks where they differ in nucleotide sequence from the reference. As detailed in (C), blue, red, purple, and green ticks indicate substitutions for the bases A, C, T and G respectively. Ticks of half height indicate extra bases in a read sequence, and missing bases are colored black. Light grey ticks indicate low quality differences between the read and reference sequences.
Figure 2
Figure 2
Building variant sequences. Illustration of the variant enumeration algorithm. Data taken from Tyson et al. 2004 [1] for Ferroplasma type I. Starting with the read labeled (1), three different paths are shown (A, B, and C) that span the gene by linking reads which overlap with no differences. To determine the sequence variants present in the data, all paths are found using all possible starting reads.
Figure 3
Figure 3
Clusters in Leptospirillum group II. Reads in Leptospirillum group II are grouped into two distinct types. Here reads are colored using the two-toned approach. Dark vertical bars within reads indicate regions that contain SNPs. Colored backgrounds (blue and green) indicate strain designations.
Figure 4
Figure 4
Gene order differences. Illustrations of gene order differences appearing in Strainer. Reads from Leptospirillum group II were assembled into contigs by Phred/Phrap. One contig and its component reads were imported from the Phrap generated ACE file and are shown in (A). Reads with gene orders that do not match the reference sequence stand out due to high levels of differences. Panel (B) shows a segment of the same region at higher zoom revealing similar patterns in all the differing reads. In (C), reads from Ferroplasma type I were aligned against the Ferroplasma acidarmanus isolate genome using blastn. BLAST cuts off alignments when the similarity ends, but gene order differences are still apparent due to multiple reads being clipped (in blue) at the same point and a cluster of reads (in yellow) missing mate-pairs. Panel (D) shows the same region at higher zoom revealing the exact point at which the read alignments are trimmed.
Figure 5
Figure 5
Grouping reads and labeling recombinants. A "strained" scaffold shown at different zoom levels: (A) full scaffold, (B) a few thousand bases, (C) viewing individual SNPs. Environmental reads were aligned to scaffold 29 from Ferroplasma type II and manually grouped into strains. Reads placed in different strains from their mate-pairs are marked with red outlines and are linked to their mate-pairs by diagonal red lines. Also marked in red are read pairs which are placed in the same strain, but at least one read had a small region that matched a different strain.

Similar articles

See all similar articles

Cited by 17 articles

See all "Cited by" articles


    1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340. - DOI - PubMed
    1. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers Y, Smith HO. Environmental genome shotgun sequencing of the sargasso sea. Science. 2004;304:66–74. doi: 10.1126/science.1093857. - DOI - PubMed
    1. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM. Comparative metagenomics of microbial communities. Science. 2005;308:554–557. doi: 10.1126/science.1107851. - DOI - PubMed
    1. Hallam SJ, Putnam N, Preston CM, Detter JC, Rokhsar D, Richardson PM, DeLong EF. Reverse methanogenesis: Testing the hypothesis with environmental genomics. Science. 2004;305:1457–1462. doi: 10.1126/science.1100025. - DOI - PubMed
    1. DeLong EF. Microbial population genomics and ecology: The road ahead. Environ Microbiol. 2004;6:875–878. doi: 10.1111/j.1462-2920.2004.00668.x. - DOI - PubMed

Publication types

LinkOut - more resources