Predicting the sub-cellular location of proteins from text using support vector machines

Pac Symp Biocomput. 2002:374-85. doi: 10.1142/9789812799623_0035.

Abstract

We present an automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts. For each protein, a vector of terms is generated from medline abstracts in which the protein/gene's name or synonym occurs. A Support Vector Machine (SVM) is used to automatically partition the term space and to thus discriminate the textual features that define sub-cellular location. The method is benchmarked on a set of proteins of known sub-cellular location from S. cerevisiae. No prior knowledge of the problem domain nor any natural language processing is used at any stage. The method out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based text classifiers. Combining text with protein amino-acid composition improves recall for some sub-cellular locations. We discuss the generality of the method and its potential application to a variety of biological classification problems.

MeSH terms

  • Automation
  • Cell Membrane / chemistry
  • Chromosomes, Fungal
  • Cytoplasm / chemistry
  • Cytoskeleton / chemistry
  • Fungal Proteins / analysis*
  • Fungal Proteins / genetics
  • Genetic Vectors
  • Lysosomes / chemistry
  • MEDLINE
  • Saccharomyces cerevisiae / chemistry*
  • Saccharomyces cerevisiae / genetics
  • Subcellular Fractions / chemistry*
  • Vacuoles / chemistry

Substances

  • Fungal Proteins