Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jul;39(13):e88.
doi: 10.1093/nar/gkr308. Epub 2011 May 13.

QuartetS: A Fast and Accurate Algorithm for Large-Scale Orthology Detection

Affiliations
Free PMC article

QuartetS: A Fast and Accurate Algorithm for Large-Scale Orthology Detection

Chenggang Yu et al. Nucleic Acids Res. .
Free PMC article

Abstract

The unparalleled growth in the availability of genomic data offers both a challenge to develop orthology detection methods that are simultaneously accurate and high throughput and an opportunity to improve orthology detection by leveraging evolutionary evidence in the accumulated sequenced genomes. Here, we report a novel orthology detection method, termed QuartetS, that exploits evolutionary evidence in a computationally efficient manner. Based on the well-established evolutionary concept that gene duplication events can be used to discriminate homologous genes, QuartetS uses an approximate phylogenetic analysis of quartet gene trees to infer the occurrence of duplication events and discriminate paralogous from orthologous genes. We used function- and phylogeny-based metrics to perform a large-scale, systematic comparison of the orthology predictions of QuartetS with those of four other methods [bi-directional best hit (BBH), outgroup, OMA and QuartetS-C (QuartetS followed by clustering)], involving 624 bacterial genomes and >2 million genes. We found that QuartetS slightly, but consistently, outperformed the highly specific OMA method and that, while consuming only 0.5% additional computational time, QuartetS predicted 50% more orthologs with a 50% lower false positive rate than the widely used BBH method. We conclude that, for large-scale phylogenetic and functional analysis, QuartetS and QuartetS-C should be preferred, respectively, in applications where high accuracy and high throughput are required.

Figures

Figure 1.
Figure 1.
The QuartetS method establishes the homology relationship between two genes x and y from two species X and Y, respectively, by exploiting phylogeny information present in a quartet gene tree formed by these two genes and two paralogous genes z1 and z2 from a third species Z. (a and b) When we do not distinguish z1 from z2, the quartet tree can have two possible topologies, where the thickened branches that highlight the two possible paths between z1 and z2 indicate the possible locations of a duplication event implied by these genes. (a) Genes x and y are paralogs if the duplication event occurs in the inner branch overlapping their path and the path between z1 and z2. (b) Because the path between x and y does not overlap with the path between z1 and z2, any duplication event inferred along the z1–z2 path is inconsequential to the relationship between genes x and y. (c) Rooting the tree can identify the last common ancestor (or root r) in an outer branch, which is not informative. (d) Alternatively, it could identify the root in the inner branch, inferring that x and y are paralogs, where the distance α between r to its nearest inner node provides a measure of the reliability of the estimate for r. We infer that genes x and y are paralogs when α is greater than a specified cutoff value (Ω), with a larger Ω leading to fewer number of paralogs and larger number of orthologs, and vice versa.
Figure 2.
Figure 2.
Function-based evaluations of the different orthology detection methods, involving 624 bacterial genomes and >2 million genes. Each entry represents the results corresponding to a given cutoff value, except for OMA, where there is only one entry corresponding to its pre-computed results. (a) Evaluations using KEGG protein function annotations. (b) Evaluations using HAMAP protein function annotations. The preferred method should yield predictions with a high fraction of predicted orthologs (FPO) and a low false positive rate (FPR), i.e. predictions close to the upper left corners of the plots. Entries close to the horizontal dashed line correspond to cutoff values for the different methods that predict similar number of orthologs as the ones in OMA.
Figure 3.
Figure 3.
Function-based pair-wise comparisons between QuartetS and each of the three methods, outgroup, OMA and QuartetS with clustering (QuartetS-C), using the cutoff values associated with the horizontal dashed lines in Figure 2. (a–c) Comparisons for the 624 genomes when all genomes were compared as one group (rightmost bar) as well as when the comparisons were performed within different granularity levels, each representing distinct evolutionary relationships based on seven bacterial taxonomy ranks, ranging from the least remote relationship (i.e. strain) to the most remote (i.e. phylum). (d–f) False positive rates (FPRs) for the unique and overlapping predictions using KEGG- and HAMAP-based function annotations.
Figure 4.
Figure 4.
Phylogeny-based pair-wise comparisons involving 120 000 genes from a subset of the 624 bacterial genomes. We used box plots to compare the congruence of a pre-specified species tree (12) with gene trees constructed by orthologous genes predicted by QuartetS and (a) outgroup, (b) OMA and (c) QuartetS-C. Higher congruence implies better orthology predictions.

Similar articles

See all similar articles

Cited by 21 articles

See all "Cited by" articles

References

    1. Liolios K, Chen IM, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz VM, Kyrpides NC. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2010;38:D346–D354. - PMC - PubMed
    1. Koonin EV. Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 2005;39:309–338. - PubMed
    1. Ohta T. Evolution by gene duplication revisited: differentiation of regulatory elements versus proteins. Genetica. 2003;118:209–216. - PubMed
    1. Serres MH, Kerr AR, McCormack TJ, Riley M. Evolution by leaps: gene duplication in bacteria. Biol. Direct. 2009;4:46. - PMC - PubMed
    1. Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics. 2005;21:2596–2603. - PubMed

Publication types

Feedback