Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Sep 19;12:373.
doi: 10.1186/1471-2105-12-373.

Quantification of Protein Group Coherence and Pathway Assignment Using Functional Association

Affiliations
Free PMC article

Quantification of Protein Group Coherence and Pathway Assignment Using Functional Association

Meghana Chitale et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Genomics and proteomics experiments produce a large amount of data that are awaiting functional elucidation. An important step in analyzing such data is to identify functional units, which consist of proteins that play coherent roles to carry out the function. Importantly, functional coherence is not identical with functional similarity. For example, proteins in the same pathway may not share the same Gene Ontology (GO) terms, but they work in a coordinated fashion so that the aimed function can be performed. Thus, simply applying existing functional similarity measures might not be the best solution to identify functional units in omics data.

Results: We have designed two scores for quantifying the functional coherence by considering association of GO terms observed in two biological contexts, co-occurrences in protein annotations and co-mentions in literature in the PubMed database. The counted co-occurrences of GO terms were normalized in a similar fashion as the statistical amino acid contact potential is computed in the protein structure prediction field. We demonstrate that the developed scores can identify functionally coherent protein sets, i.e. proteins in the same pathways, co-localized proteins, and protein complexes, with statistically significant score values showing a better accuracy than existing functional similarity scores. The scores are also capable of detecting protein pairs that interact with each other. It is further shown that the functional coherence scores can accurately assign proteins to their respective pathways.

Conclusion: We have developed two scores which quantify the functional coherence of sets of proteins. The scores reflect the actual associations of GO terms observed either in protein annotations or in literature. It has been shown that they have the ability to accurately distinguish biologically relevant groups of proteins from random ones as well as a good discriminative power for detecting interacting pairs of proteins. The scores were further successfully applied for assigning proteins to pathways.

Figures

Figure 1
Figure 1
CAS and PAS distribution for the same domain GO term pairs. A, CAS distribution for GO term pairs in MF; B, CAS distribution for BP; C, CAS distribution in CC; D, PAS distribution for MF; E, PAS distribution for in BP; F, PAS distribution in CC. Out of 5,610,201 non-zero CAS values for GO term pairs, 107,673 (1.91%) were MF pairs, 3,430,135 (61.14%) were BP pairs, and 73,909 (1.31%) were CC pairs. Out of 3,320,265 non-zero PAS values for GO term pairs, 73,556 (2.21%) were MF pairs, 1,999,993 (60.23%) were BP pairs, and 51,816 (1.56%) were CC pairs.
Figure 2
Figure 2
CAS and PAS distribution for cross domain GO term pairs. A, MF-BP CAS distribution; B, BP-CC CAS distribution; C, CC-MF CAS distribution; D, MF-BP PAS distribution; E, BP-CC PAS distribution; F, CC-MF PAS distribution. Out of 5,610,201 non-zero CAS values for GO term pairs, 1,026,484 (18.3%) were MF-BP pairs, 787,000 (14.0%) were BP-CC pairs and 183,001 (3.3%) were CC-MF pairs. Out of 3,320,265 non-zero PAS values for GO term pairs, 614,509 (18.5%) were MF-BP, 471,879 (14.2%) were BP-CC pairs, and 108,512 (3.3%) were CC-MF pairs.
Figure 3
Figure 3
Relationship between CAS and PAS for a sample set of GO terms pairs. Out of the 5,610,201 GO term pairs for which CAS has been computed and 5,255,249 pairs for which PAS has been computed, we randomly sampled 50,000 pairs. Out of this 29,474 pairs were selected where both CAS and PAS are non zero values. The correlation coefficient of the two scores is 0.3084.
Figure 4
Figure 4
Relationship of CAS and PAS with the semantic similarity scores. A, CAS vs. semantic similarity for MF pairs (correlation coefficient: r = 0.5037); B, CAS vs. semantic similarity for BP pairs (r = 0.3450); C, CAS vs. semantic similarity for CC pairs (r = 0.4202); D, PAS vs. semantic similarity for MF pairs (r = 0.3023); E, PAS vs. semantic similarity for BP pairs (r = 0.1711); F, PAS vs. semantic similarity for CC pairs (r = 0.2613). The GO term pairs used in these plots are the same as the one used in Figure 3 (29,474 pairs). The semantic similarity has been computed using Eqn. 10. Since semantic similarity describes relationship between the terms of the same GO domain, the plots only include GO term pairs from the same domain (646 MF pairs, 17,731 BP pairs and 492 CC pairs).
Figure 5
Figure 5
The size of protein sets in the three datasets. A, the KEGG pathway dataset; B, the protein complex dataset; C, the GOcc dataset.
Figure 6
Figure 6
Percentage of protein sets identified at different p-value cutoffs. Each protein set is evaluated by p-value of the five coherence scores, CAS, PAS, funsim, Chagoyen, and Pandey, and those which have more significant p-value than the cutoff are counted. A, the KEGG pathway dataset; B, the protein complex dataset; C, the GOcc set; D, the random set.
Figure 7
Figure 7
Comparison of p-value of CAS_coherence and PAS_coherence scores. A, Comparison on the pathway set; B, the protein complex dataset; C, the GOcc set.
Figure 8
Figure 8
Coherence score comparisons with/without obviously related GO domain. For the pathway sets, the coherence scores were compared with and without the BP domain annotations. A, The CAS with all the three domains (the x-axis) while CAS_Coherence(BP-) (y-axis) is the CAS computed without BP terms. B, PAS coherence scores with and without BP terms. For the GOcc sets coherence scores were compared with and without (CC-) the use of CC domain annotations. C, CAS coherence; D, PAS coherence.
Figure 9
Figure 9
Percentage of protein sets identified at different p-value cutoffs using partial annotation information. The p-value of the CAS and the PAS coherence scores were computed with and without (BP-) BP domain annotations for the pathway dataset. A, CAS coherence; B, PAS coherence. The GOcc sets were evaluated with and without (CC-) CC domain annotations. C, CAS coherence; D, PAS coherence.
Figure 10
Figure 10
ROC curves for detection of interacting protein pairs by functional similarity/association scores. Protein pairs with significant p-value of the functional similarity/association scores (CAS, PAS, funsim, Chagoyen, and Pandey) were predicted to be interacting with each other. A, yeast PPI data; B, human PPI data.
Figure 11
Figure 11
Pathway assignment for yeast proteins. Coherence score of the query protein with each KEGG pathway is used for ranking the KEGG pathways to indicate where the query protein is more likely to be assigned. The cumulative percentages of query proteins assigned to their correct pathway within the top X ranks are plotted. CAS, CAS(BP-) (without BP annotations), PAS, PAS(BP-) (without BP annotations), funsim, GOscore_coherenceBP (using only BP annotations), Chagoyen, and Pandey were used.

Similar articles

See all similar articles

Cited by 4 articles

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98. - PubMed
    1. Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P. et al. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 2003;31:400–402. doi: 10.1093/nar/gkg030. - DOI - PMC - PubMed
    1. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R. et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–D251. doi: 10.1093/nar/gkj149. - DOI - PMC - PubMed
    1. Gaulton A, Attwood TK. Motif3D: Relating protein sequence motifs to 3D structure. Nucleic Acids Res. 2003;31:3333–3336. doi: 10.1093/nar/gkg534. - DOI - PMC - PubMed

Publication types

LinkOut - more resources

Feedback