Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec;7(12):e1002269.
doi: 10.1371/journal.pcbi.1002269. Epub 2011 Dec 1.

Identifying Single Copy Orthologs in Metazoa

Free PMC article

Identifying Single Copy Orthologs in Metazoa

Christopher J Creevey et al. PLoS Comput Biol. .
Free PMC article


The identification of single copy (1-to-1) orthologs in any group of organisms is important for functional classification and phylogenetic studies. The Metazoa are no exception, but only recently has there been a wide-enough distribution of taxa with sufficiently high quality sequenced genomes to gain confidence in the wide-spread single copy status of a gene.Here, we present a phylogenetic approach for identifying overlooked single copy orthologs from multigene families and apply it to the Metazoa. Using 18 sequenced metazoan genomes of high quality we identified a robust set of 1,126 orthologous groups that have been retained in single copy since the last common ancestor of Metazoa. We found that the use of the phylogenetic procedure increased the number of single copy orthologs found by over a third more than standard taxon-count approaches. The orthologs represented a wide range of functional categories, expression profiles and levels of divergence.To demonstrate the value of our set of single copy orthologs, we used them to assess the completeness of 24 currently published metazoan genomes and 62 EST datasets. We found that the annotated genes in published genomes vary in coverage from 79% (Ciona intestinalis) to 99.8% (human) with an average of 92%, suggesting a value for the underlying error rate in genome annotation, and a strategy for identifying single copy orthologs in larger datasets. In contrast, the vast majority of EST datasets with no corresponding genome sequence available are largely under-sampled and probably do not accurately represent the actual genomic complement of the organisms from which they are derived.

Conflict of interest statement

The authors have declared that no competing interests exist.


Figure 1
Figure 1. Project workflow.
The analysis workflow is divided into 3 major steps. The first step (Eukaryotic guide tree construction) aims at constructing the guide tree used to infer duplication and loss events. The second step (Identification of core metazoan gene families) is the core of our method, i.e. the identification within the eggNOG database of the single copy genes. The last step concerns the extraction of the single copy genes from the EST datasets.
Figure 2
Figure 2. Gene tree reconciliation process.
Reconciling a gene tree with a (guide) species tree. A) Given the species tree on the left, we need to estimate the most parsimonious number of duplications and losses that explain the topology and distribution of the gene tree (on the right). In order to assess correctly the number of duplications and losses, we need to find the best rooting of the gene tree. To this end, the gene tree is rooted at every possible position, and for each rooting, the most parsimonious number of duplications and losses is calculated. The rooting that requires the fewest number of steps (duplications and losses) is considered the most parsimonious rooting of the gene tree. For example: the reconciliations for two possible rootings are shown: positions X and Y in panes B) and C). The positions of duplication events are indicated with a diamond, losses are indicated with a dashed line. B) Rooting the gene tree at position X in B) requires duplication and two losses, while rooting at position Y in C) requires 1 duplication and 1 loss. Of the two rootings, position Y is the most parsimonious. The numbers on the internal branches indicate the internal branch of the species tree in A)that they are mapped to. If we were trying to identify single copy genes at the hierarchical level of internal branch 2 on the species tree, then the sub-tree marked with a * in C) would represent a gene family that has been in single copy since this hierarchical level.
Figure 3
Figure 3. Eukaryotic guide trees used in the analysis.
The Eukaryotic guide trees constructed based on a concatenated alignment of the 40 universally distributed genes . A) The phylogeny supporting the Coelomata hypothesis for the evolution of animals. B) The phylogeny supporting the Ecdysozoa hypothesis for the evolution of animals was created by hand from A). Branch lengths represent the evolutionary distances between the taxa based on their amino acid sequences and were estimated using the same alignments of universal genes. Both trees were used in the gene-tree reconciliation step, so as not to bias subsequent analyses towards either hypothesis. Filled circles represent internal branches that received greater than 95% Bootstrap proportion (BP) support. Open circles represent internal branches with greater than 60% BP support.
Figure 4
Figure 4. Distribution of single copy genes in the analyzed species.
Distribution of single copy genes across all studied species. The tree contains the species analyzed in this study and their relationships as defined by the NCBI taxonomy. The number of single copy genes found in each species is shown, along with a representation of that value as a percentage of all the 1,126 single copy genes and as a percentage of the total number of genes in the genome or EST dataset used. The black bars represent counts from genomes, grey bars from published EST datasets. Species names in bold indicate the species that were used to define the set of single copy orthologs.
Figure 5
Figure 5. Multigene family reconstruction.
An example of the reconciliation of a proteasome 26S subunit multigene family is shown in the left. Duplications are hypothesized to have occurred on the branches colored in red, while those branches that are hypothesized to be lost are in grey. The subtree in the dashed box has been identified as being in single copy. The tree on the right is a more detailed view of the same clade. The leaves on the tree are labeled with their species names followed by the protein ID of the specific sequence that was mapped to that position.

Similar articles

See all similar articles

Cited by 13 articles

See all "Cited by" articles


    1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. - PubMed
    1. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. - PMC - PubMed
    1. Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, et al. Comparative genomics of the eukaryotes. Science. 2000;287:2204–2215. - PMC - PubMed
    1. Henikoff S, Greene EA, Pietrokovski S, Bork P, Attwood TK, et al. Gene families: the taxonomy of protein paralogs and chimeras. Science. 1997;278:609–614. - PubMed
    1. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. - PMC - PubMed

Publication types