Gene function prediction using labeled and unlabeled data
- PMID: 18221567
- PMCID: PMC2275242
- DOI: 10.1186/1471-2105-9-57
Gene function prediction using labeled and unlabeled data
Abstract
Background: In general, gene function prediction can be formalized as a classification problem based on machine learning technique. Usually, both labeled positive and negative samples are needed to train the classifier. For the problem of gene function prediction, however, the available information is only about positive samples. In other words, we know which genes have the function of interested, while it is generally unclear which genes do not have the function, i.e. the negative samples. If all the genes outside of the target functional family are seen as negative samples, the imbalanced problem will arise because there are only a relatively small number of genes annotated in each family. Furthermore, the classifier may be degraded by the false negatives in the heuristically generated negative samples.
Results: In this paper, we present a new technique, namely Annotating Genes with Positive Samples (AGPS), for defining negative samples in gene function prediction. With the defined negative samples, it is straightforward to predict the functions of unknown genes. In addition, the AGPS algorithm is able to integrate various kinds of data sources to predict gene functions in a reliable and accurate manner. With the one-class and two-class Support Vector Machines as the core learning algorithm, the AGPS algorithm shows good performances for function prediction on yeast genes.
Conclusion: We proposed a new method for defining negative samples in gene function prediction. Experimental results on yeast genes show that AGPS yields good performances on both training and test sets. In addition, the overlapping between prediction results and GO annotations on unknown genes also demonstrates the effectiveness of the proposed method.
Figures
Similar articles
-
Gene function prediction by a combined analysis of gene expression data and protein-protein interaction data.J Bioinform Comput Biol. 2005 Dec;3(6):1371-89. doi: 10.1142/s0219720005001612. J Bioinform Comput Biol. 2005. PMID: 16374912
-
A weighted power framework for integrating multisource information: gene function prediction in yeast.IEEE Trans Biomed Eng. 2012 Apr;59(4):1162-8. doi: 10.1109/TBME.2012.2186689. Epub 2012 Feb 3. IEEE Trans Biomed Eng. 2012. PMID: 22318478
-
CGI: a new approach for prioritizing genes by combining gene expression and protein-protein interaction data.Bioinformatics. 2007 Jan 15;23(2):215-21. doi: 10.1093/bioinformatics/btl569. Epub 2006 Nov 10. Bioinformatics. 2007. PMID: 17098772
-
Inferring network interactions within a cell.Brief Bioinform. 2005 Dec;6(4):380-9. doi: 10.1093/bib/6.4.380. Brief Bioinform. 2005. PMID: 16420736 Review.
-
An Experimental Approach to Genome Annotation: This report is based on a colloquium sponsored by the American Academy of Microbiology held July 19-20, 2004, in Washington, DC.Washington (DC): American Society for Microbiology; 2004. Washington (DC): American Society for Microbiology; 2004. PMID: 33001599 Free Books & Documents. Review.
Cited by
-
Learning Peptide Properties with Positive Examples Only.bioRxiv [Preprint]. 2023 Jun 5:2023.06.01.543289. doi: 10.1101/2023.06.01.543289. bioRxiv. 2023. PMID: 37333233 Free PMC article. Preprint.
-
Computational Methods for Prediction of Human Protein-Phenotype Associations: A Review.Phenomics. 2021 Aug 6;1(4):171-185. doi: 10.1007/s43657-021-00019-w. eCollection 2021 Aug. Phenomics. 2021. PMID: 36939789 Free PMC article. Review.
-
Gene Mining and Flavour Metabolism Analyses of Wickerhamomyces anomalus Y-1 Isolated From a Chinese Liquor Fermentation Starter.Front Microbiol. 2022 May 2;13:891387. doi: 10.3389/fmicb.2022.891387. eCollection 2022. Front Microbiol. 2022. PMID: 35586860 Free PMC article.
-
Gene function prediction based on combining gene ontology hierarchy with multi-instance multi-label learning.RSC Adv. 2018 Aug 10;8(50):28503-28509. doi: 10.1039/c8ra05122d. eCollection 2018 Aug 7. RSC Adv. 2018. PMID: 35542493 Free PMC article.
-
DeepCAPE: A Deep Convolutional Neural Network for the Accurate Prediction of Enhancers.Genomics Proteomics Bioinformatics. 2021 Aug;19(4):565-577. doi: 10.1016/j.gpb.2019.04.006. Epub 2021 Feb 11. Genomics Proteomics Bioinformatics. 2021. PMID: 33581335 Free PMC article.
References
-
- Gavin AC, Bösche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Höfert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415:141–147. doi: 10.1038/415141a. - DOI - PubMed
-
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sørensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002;415:180–183. doi: 10.1038/415180a. - DOI - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Molecular Biology Databases
