Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Filters applied. Clear all
. 2016 Nov 16;44(20):9600-9610.
doi: 10.1093/nar/gkw843. Epub 2016 Sep 26.

Finding Approximate Gene Clusters With Gecko 3

Affiliations
Free PMC article

Finding Approximate Gene Clusters With Gecko 3

Sascha Winter et al. Nucleic Acids Res. .
Free PMC article

Abstract

Gene-order-based comparison of multiple genomes provides signals for functional analysis of genes and the evolutionary process of genome organization. Gene clusters are regions of co-localized genes on genomes of different species. The rapid increase in sequenced genomes necessitates bioinformatics tools for finding gene clusters in hundreds of genomes. Existing tools are often restricted to few (in many cases, only two) genomes, and often make restrictive assumptions such as short perfect conservation, conserved gene order or monophyletic gene clusters. We present Gecko 3, an open-source software for finding gene clusters in hundreds of bacterial genomes, that comes with an easy-to-use graphical user interface. The underlying gene cluster model is intuitive, can cope with low degrees of conservation as well as misannotations and is complemented by a sound statistical evaluation. To evaluate the biological benefit of Gecko 3 and to exemplify our method, we search for gene clusters in a dataset of 678 bacterial genomes using Synechocystis sp. PCC 6803 as a reference. We confirm detected gene clusters reviewing the literature and comparing them to a database of operons; we detect two novel clusters, which were confirmed by publicly available experimental RNA-Seq data. The computational analysis is carried out on a laptop computer in <40 min.

Figures

Figure 1.
Figure 1.
Workflow proposed for the analysis of gene order data using Gecko 3. (1) Gene order information imported from GenBank files. Homologous gene families are (2a) imported from a database such as STRING (36) or (2b) computed using all-against-all BLAST of the gene sequences, then applying a tool for finding gene families such as TransClust (30). We supply Python scripts for this step of the analysis pipeline. (3) The combination of gene order information and homology classification is imported into Gecko 3. (4) Gecko 3 finds all (hypothetical) gene clusters that are within the parameters given by the user. (5) Each gene cluster is evaluated by its P-value (significance), estimating the probability to encounter a gene cluster of this quality in randomized genomes. (6) Gene clusters are sorted by P-value, and (7) those showing a large overlap with a better gene cluster can be hidden in the user interface.
Figure 2.
Figure 2.
The Gecko 3 user interface after a gene cluster search has finished and one of the clusters has been selected for closer observation. ‘Score’ and ‘C-Score’ are negative logarithms (base 10) of the estimated P-value (uncorrected and FDR-corrected, respectively); for example, C-Score 395.284 corresponds to corrected P-value 10−395.284 = 5.20 × 10−396.
Figure 3.
Figure 3.
Selected occurrences of the gene cluster with ID 520 in Table 1 found in the STRING dataset. All genes of identical color belong to the same gene family; gene annotations are taken from RefSeq notes if available. In total, the cluster is found in 129 genomes. In the reference genome Synechocystis sp. PCC 6803, the cluster has five protein coding sequences, but one gene with locus_tag slr0902, named moaC has two different functional units (annotated COG0315, MoaC and COG0746, MobA in the RefSeq notes), here depicted in red and brown. For Gluconobacter oxydans, two other functional units (COG0746, MobA and COG1977, MoaD) are located on the same coding sequence, illustrated in brown and dark blue. Apart from that, the cluster is perfectly conserved between this two genomes. The gene order in Escherichia coli is well conserved, but two genes are missing (moeA and mobA) and moaB is inserted. In Phenylobacterium zucineum we find all but one (mobA) gene families of the reference genes, and again an inserted moeB gene, but with deviating gene order. Lysinibacillus sphaericus contains all genes from Synechocystis sp. PCC 6803, but we find three additional genes in that region and moeA is at a different position. The orientation of genes varies. Entries ‘ins.’, ‘del.’ and ‘sum’ give the number of additional, missing and sum of genes of occurrence versus reference gene cluster.
Figure 4.
Figure 4.
Venn-Diagrams comparing results of Gecko 3, DOOR 2.0 (45) and operons reported by Kopf et al. (39). Gecko 3 is run with default parameters (left) and parameters with reduced minimum size and increased maximum distance, increasing sensitivity but decreasing specificity of the method (right, see Supplementary Table S3). We combine clusters and operons in the three dataset based on connected components.
Figure 5.
Figure 5.
Confirmation of the four gene clusters 9, 57, 485 and 584 as operons using six RNA-Seq libraries from (39). Gene clusters ID 9 (pchR operon, left top), ID 57 (pilJ operon, right top), ID 485 (rcs operon, left bottom) and ID 584 (cysW, cysT, cysA operon, right bottom). Notably, rcs operon can be extended by another gene on the antisense strand, being part of the conserved cluster. In each subfigure, the upper half refers to the plus strand and the lower half to the minus strand. The y-axis is adjusted to 100 reads per RNA-Seq library. Orange and green—all untreated; black—dark, no light for 12 h; yellow—high light, 470 μmol quanta m−2s−1 for 30 min; red—heat stress, 42 °C for 30 min; light blue—cold stress, 15 °C for 30 min; blue—annotation. Continuous coverage of reads across several genes of a cluster are commonly interpreted as operons. Red boxed genes were detected to be part of the cluster.

Similar articles

See all similar articles

Cited by 5 articles

References

    1. Tang H., Bowers J.E., Wang X., Ming R., Alam M., Paterson A.H. Synteny and collinearity in plant genomes. Science. 2008;320:486–488. - PubMed
    1. Overbeek R., Fonstein M., D'Souza M., Pusch G.D., Maltsev N. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. U.S.A. 1999;96:2896–2901. - PMC - PubMed
    1. Wolf Y.I., Rogozin I.B., Kondraskov A.S., Koonin E.V. Genome alignment, evolution of procaryotic genome organization, and prediction of gene function using genomic context. Genome Res. 2001;11:356–372. - PubMed
    1. Korbel J.O., Jensen L.J., von Mering C., Bork P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat. Biotechnol. 2004;22:911–917. - PubMed
    1. Schwartze V.U., Winter S., Shelest E., Marcet-Houben M., Horn F., Wehner S., Linde J., Valiante V., Sammeth M., Riege K., et al. Gene expansion shapes genome architecture in the human pathogen Lichtheimia corymbifera: an evolutionary genomics analysis in the ancient terrestrial Mucorales (Mucoromycotina) PLOS Genet. 2014;10:e1004496. - PMC - PubMed
Feedback