Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jan;36(Database issue):D255-62.
doi: 10.1093/nar/gkm924. Epub 2007 Nov 5.

EPGD: A Comprehensive Web Resource for Integrating and Displaying Eukaryotic Paralog/Paralogon Information

Affiliations
Free PMC article

EPGD: A Comprehensive Web Resource for Integrating and Displaying Eukaryotic Paralog/Paralogon Information

Guohui Ding et al. Nucleic Acids Res. .
Free PMC article

Abstract

Gene duplication is common in all three domains of life, especially in eukaryotic genomes. The duplicates provide new material for the action of evolutionary forces such as selection or genetic drift. Here we describe a sophisticated procedure to extract duplicated genes (paralogs) from 26 available eukaryotic genomes, to pre-calculate several evolutionary indexes (evolutionary rate, synonymous distance/clock, transition redundant exchange clock, etc.) based on the paralog family, and to identify block or segmental duplications (paralogons). We also constructed an internet-accessible Eukaryotic Paralog Group Database (EPGD; http://epgd.biosino.org/EPGD/). The database is gene-centered and organized by paralog family. It focuses on paralogs and evolutionary duplication events. The paralog families and paralogons can be searched by text or sequence, and are downloadable from the website as plain text files. The database will be very useful for both experimentalists and bioinformaticians interested in the study of duplication events or paralog families.

Figures

Figure 1.
Figure 1.
Web pages for gene record (A), paralog family (B) and paralogon region (C). (A) Example of a gene record for H. sapiens. The gene record web page consists of three segments: basic information, paralogon links and coding sequences. Through paralogon links, paralogons ‘including’ or ‘covering’ this gene can be accessed. (B) Example of a paralog family. Gene list, multi-alignment and pre-calculated evolutionary indexes can be obtained from this page. The user can visualize the multi-alignment via JalView (28). In addition, an UPGMA tree is built and rendered with a Java applet. (C) Paralogon region with a highlighted gene (colored red). Several basic properties (average block length, average block density, number of links) are displayed in the page. In the paralogon figures, the paralogs in these regions are connected with lines. Each gene in these figures is linked to the gene record in database.
Figure 2.
Figure 2.
Database searching. (A) Quick search for ‘any text’ or sequences. (B) Advanced text search. NCBI Gene ID, member ID, paralog family ID, paralogon ID, gene symbol and any word in the gene description can be applied as search fields. (C) Advanced sequence search by NCBI BLAST (20). (D) Query result with a navigation bar.
Figure 3.
Figure 3.
Number of families (A), average size of families (B), ratio of paralogs (C) and number of paralogons (D) in different genomes. Number of genes denotes the size of a genome, r is the correlation coefficient and P is P-value.
Figure 4.
Figure 4.
Statistics of the paralog families in H. sapiens. (A) Frequency distribution of the sizes of the paralogon families and the corresponding log–log diagram. Note that the families with more than 17 gene members were omitted in this plot and that the largest family is olfactory receptor family, which possesses 377 genes. In the log–log diagram, the logarithms of these two variables fit the linear model (r = −0.8191, P = 1.013 × 10−9). (B) Negative correlation between TREx distance and dS. The points with dS < 1.0 were used in this panel. The correlation coefficient of these two variables is −0.8916967 (P < 2.2 × 10−16). The line generated with least squares fit has a slope of −0.3051738. (C) dN as a function of dS. Data points are divided into two groups, black points denoting gene pairs for which the ratio dN/dS is not significantly different from the neutral expectation of 1 (−1.96 < z Score < 1.96) and green points denoting gene pairs whose dN/dS is different from the neutral expectation of 1 (z Score > 1.96 or z Score < −1.96). The dashed line denotes dN = dS. (D) Frequency distribution of sizes of paralogons, which are defined as the number of linked families in this region.

Similar articles

See all similar articles

Cited by 11 articles

See all "Cited by" articles

References

    1. Taylor JS, Raes J. Duplication and divergence: the evolution of new genes and old ideas. Annu. Rev. Genet. 2004;38:615–643. - PubMed
    1. Zhang J. Evolution by gene duplication: an update. Trends Ecol. Evol. 2003;18:292–298.
    1. He X, Zhang J. Gene complexity and gene duplicability. Curr. Biol. 2005;15:1016–1021. - PubMed
    1. Van de Peer Y. Computational approaches to unveiling ancient genome duplications. Nat. Rev. 2004;5:752–763. - PubMed
    1. Teichmann SA, Babu MM. Gene regulatory network growth by duplication. Nat. Genet. 2004;36:492–496. - PubMed

Publication types

Feedback