Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Filters applied. Clear all
. 2012;7(6):e39107.
doi: 10.1371/journal.pone.0039107. Epub 2012 Jun 18.

Gegenees: Fragmented Alignment of Multiple Genomes for Determining Phylogenomic Distances and Genetic Signatures Unique for Specified Target Groups

Free PMC article

Gegenees: Fragmented Alignment of Multiple Genomes for Determining Phylogenomic Distances and Genetic Signatures Unique for Specified Target Groups

Joakim Agren et al. PLoS One. .
Free PMC article


The rapid development of Next Generation Sequencing technologies leads to the accumulation of huge amounts of sequencing data. The scientific community faces an enormous challenge in how to deal with this explosion. Here we present a software tool, 'Gegenees', that uses a fragmented alignment approach to facilitate the comparative analysis of hundreds of microbial genomes. The genomes are fragmented and compared, all against all, by a multithreaded BLAST control engine. Ready-made alignments can be complemented with new genomes without recalculating the existing data points. Gegenees gives a phylogenomic overview of the genomes and the alignment can then be mined for genomic regions with conservation patterns matching a defined target group and absent from a background group. The genomic regions are given biomarker scores forming a uniqueness signature that can be viewed and explored, graphically and in tabular form. A primer/probe alignment tool is also included for specificity verification of currently used or new primers. We exemplify the use of Gegenees on the Bacillus cereus group, on Foot and Mouth Disease Viruses, and on strains from the 2011 Escherichia coli O104:H4 outbreak. Gegenees contributes towards an increased capacity of fast and efficient data mining as more and more genomes become sequenced.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.


Figure 1
Figure 1. Overview of Gegenees.
The Gegenees workspace contains one or several local databases. Genomes can be downloaded from the NCBI ftp site or from custom ftp sites through a built-in ftp client. This client compares the content of the local database with the remote one and highlights genomes already present locally. Unpublished genomes or genomes downloaded from other sources can be imported. The Gegenees workspace can also contain comparison projects. Genomes are added to the comparison from the local database. Genomes already in the active comparison are highlighted in the local database to facilitate the update process. Comparisons can also be downloaded from or shared between labs and imported into the workspace. One or several fragmented alignments can be made in the comparison with custom-specified resolution. Large alignments are associated with lengthy calculations and can therefore be paused and later resumed. Genomes can also later be added to a completed alignment that is then updated with the missing data points. When an alignment has been completed, the phylogenomic context can be analyzed in heat-plots. Nexus files can be exported for dendrogram construction and heat plots can be exported for high-resolution printouts. The alignment can then be analyzed in terms of Biomarker scores and uniqueness signatures. A target and a background group are defined on the basis of strain phenotypes and phylogenomic overview. The resulting conservation pattern signature can then be viewed and explored graphically or in tables. The signatures can also be exported to Artemis. Primers and/or probes can be designed from the signatures and candidate primers can be added back to Gegenees in form of a primer/probe alignment. Primer specificity can then be analyzed in terms of mismatches in the target and background groups.
Figure 2
Figure 2. Gegenees calculation speed.
Calculation benchmark made on a workstation equipped with a 3.2 GHz Intel i7 -970 processor (6 cores with hyper-threading, i.e., 12 simultaneous threads). A. The Gegenees source code was modified so that the number of simultaneous threads was limited to 1, 2, 3…. The time for completing a comparison with 10 Bacillus.spp genomes (∼5 Mb each) with BLAST (blastall) or BLAST+ was measured. When no thread limit was used Gegenees chose to use 12 threads on this machine. B. Time required for completing an alignment with an increasing number of Bacillus spp. genomes. Progressive Mauve (version 2.3.1) with default settings and Gegenees with different settings (500/500 or 200/100 using blastall or BLAST+) were compared. The asterisk indicates the upper limit of genomes (30) we could align in Progressive Mauve on this machine.
Figure 3
Figure 3. Phylogenomic overview in Gegenees.
Both heat-plots of the similarity matrices and trees created from the same data are shown. A. A Gegenees heat-plot over a set of Bacillus strains that had previously been analyzed by MLST . The heat-plot is based on a fragmented alignment using BLASTN made with settings 200/100. The cutoff threshold for non-conserved material was 30%. A dendrogram was produced in SplitsTree 4 (using neighbor joining method) made from a Nexus file exported from Gegenees. B. cytotoxicus was set as outgroup. The clustering is very similar to previously published trees. The scale bar represents a 1% difference in average BLASTN score similarity. B. A Gegenees heat-plot over a set of yeast genomes that has been analyzed before with different phylogenomic methods. These genomes are more distant from each other and a BLASTN comparison does not resolve them well (data not shown). A fragmented alignment in TBLASTX mode was performed with settings 200/200. The cutoff threshold for non-conserved material was 20%. A dendrogram was produced in SplitsTree 4 (using neighbor joining method) made from a distance matrix Nexus file exported from Gegenees. Y. lipolytica was set as outgroup. The clustering here is also very similar to the previously published trees . The scale bar represents a 10% difference in average TBLASTX score similarity.
Figure 4
Figure 4. Comparative analysis of the Bacillus cereus group.
A heat-plot based on a 200/100 BLASTN fragmented alignment without threshold is shown. The figure is cropped to show only the Bacillus cereus group. Target groups used for PCR design are indicated (T1–T5). All remaining Bacillus genomes were used as a background group. This analysis was made without a threshold to filter non-conserved genetic material. Viewing the heat-plot without a threshold means that the values are based on both the core genome size and the core conservation. This often gives a better view during target group formulation because signatures are per definition outside the core when comparing a target genome with a background genome. Insert A shows the uniqueness signature for B. anthracis (T1). Signatures for all groups are present in Figure S3. Insert B shows a dendrogram based on the heat plot. The dendrogram was produced in SplitsTree 4 (using neighbor joining method) made from a distance matrix Nexus file exported from Gegenees. B. cytotoxicus was set as outgroup.
Figure 5
Figure 5. Signature analysis of Foot-and-Mouth Disease Virus (FMDV) serotypes.
A fragmented alignment was performed with 50/25 settings using BLASTN (BLAST+). Target groups were formulated according to the serotype definitions. All other serotypes were used as background. The ‘maximum background/average target’ setting was used for biomarker score calculation. The annotations shown come from the type Asia 1 isolate IND 13–91(DQ989312). VP1–VP4 constitutes the capsid proteins that are exposed on the virus particle and are therefore important determinants for serotype classification.
Figure 6
Figure 6. Signature analysis of the E. coli O104:H4 strain from the food poisoning outbreak in 2011. A.
A signature representing the genetic material that the outbreak strain LB226692 (accession AFOB02000000) has in common with previous severe food-poisoning outbreak strains (Sakai Japan 1996 (accession NC_002695), Michigan and Oregon 1982 (accession NC_002655), the spinach outbreak in western USA 2006 (accession NC_013008) and the lettuce outbreak in eastern USA 2006 (accession NZ_ABKY00000000)) but not in common with a background strain representing another E. coli O104 strain 55989 (accession NC_011748). B. Plasmid profiling using Gegenees. Two O:104 isolates, one from the 2011 outbreak (LB226692) and the other a HUS-associated O104 strain from 2001 (accession AFPS01000000), were compared to a set of plasmids with a fragmented alignment 200/100 using BLASTN.

Similar articles

See all similar articles

Cited by 92 articles

See all "Cited by" articles


    1. Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. - PubMed
    1. Jeffroy O, Brinkmann H, Delsuc F, Philippe H. Phylogenomics: the beginning of incongruence? Trends in genetics : TIG. 2006;22:225–231. - PubMed
    1. Dubchak I, Poliakov A, Kislyuk A, Brudno M. Multiple whole-genome alignments without a reference organism. Genome research. 2009;19:682–689. - PMC - PubMed
    1. Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, et al. Cactus: Algorithms for genome multiple sequence alignment. Genome research. 2011;21:1512–1528. - PMC - PubMed
    1. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome research. 2004;14:708–715. - PMC - PubMed

Publication types

LinkOut - more resources