A hybrid named entity tagger for tagging human proteins/genes

Int J Data Min Bioinform. 2014;10(3):315-28. doi: 10.1504/ijdmb.2014.064545.

Abstract

The predominant step and pre-requisite in the analysis of scientific literature is the extraction of gene/protein names in biomedical texts. Though many taggers are available for this Named Entity Recognition (NER) task, we found none of them achieve a good state-of-art tagging for human genes/proteins. As most of the current text mining research is related to human literature, a good tagger to precisely tag human genes and proteins is highly desirable. In this paper, we propose a new hybrid approach based on (a) machine learning algorithm (conditional random fields), (b) set of (manually constructed) rules, and (c) a novel abbreviation identification algorithm to surmount the common errors observed in available taggers to tag human genes/proteins. Experiment results on JNLPBA2004 corpus show that our domain specific approach achieves a high precision of 80.47, F-score of 75.77 and outperforms most of the state-of-the-art systems. However, the recall of 71.60 still remains low and leaves much room for future improvement.

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • Computational Biology / methods*
  • Data Mining
  • Databases, Factual
  • Genes
  • Humans
  • Models, Statistical
  • Proteins / chemistry*
  • Reproducibility of Results
  • Vocabulary, Controlled*

Substances

  • Proteins