Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2003 Sep;13(9):2178-89.
doi: 10.1101/gr.1224503.

OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes

Affiliations
Free PMC article
Comparative Study

OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes

Li Li et al. Genome Res. .
Free PMC article

Abstract

The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of "recent" paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome.

Figures

Figure 1
Figure 1
Flow chart of the OrthoMCL algorithmfor clustering orthologous proteins.
Figure 2
Figure 2
Illustration of sequence relationships and similarity matrix construction. Dotted arrows represent “recent” paralogy (duplication subsequent to speciation); solid arrows represent orthology. The upper right half of the matrix contains initial weights calculated as average –log10 (P-value) frompairwise WU-BLASTP similarities. The lower left half contains corrected weights supplied to the MCL algorithm; the edge weight connecting each pair of sequences wij is divided by Wij/W, where W represents the average weight among all ortholog (underlined) and “recent” paralog (italicized) pairs, and Wij represents the average edge weight among all ortholog pairs from species i and j. The net result of this normalization is to correct for systematic differences in comparisons between two species (e.g., differences attributable to nucleotide composition bias), and when i = j, to minimize the impact of “recent” paralogs (duplication within a given species) on the clustering of cross-species orthologs.
Figure 3
Figure 3
Example of a group from the EGO subset that is extended by OrthoMCL. Five synaptobrevin genes were clustered together by OrthoMCL (GroupID #379767), including yeast SNC1 and SNC2, fly Syb and n-syb, and worm snb-1. Thick solid arrows represent orthology identified by reciprocal best matches, dotted arrows represent “recent” paralogs, and thin solid arrows represent one-way best matches indicating the direction from query to subject (based on BLASTP comparisons). Only snb-1, n-syb, and Syb (dark gray) were identified by the EGO subset (groups TOG257010, TOG272289, TOG273790), and these genes were only grouped because their gene index sequences (TC72314, TC140251, TC134828) formed `triangles' of reciprocal best matches based on BLASTN comparisons with other species not shown in this analysis.
Figure 4
Figure 4
Screenshots of the Web interface. A keyword search (top left) identifies 11 ortholog groups containing sequences with the word “tubulin” in sequence name or description (top right). Clicking the group ID pulls up a page describing sequences in the group (bottom left), a graphical display of relationships among these sequences (bottom right), and a CLUSTALW multiple sequence alignment (bottom center).

Similar articles

See all similar articles

Cited by 2,333 articles

See all "Cited by" articles

Publication types

Feedback