Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 1;36(1):280-286.
doi: 10.1093/bioinformatics/btz504.

Towards reliable named entity recognition in the biomedical domain

Affiliations

Towards reliable named entity recognition in the biomedical domain

John M Giorgi et al. Bioinformatics. .

Abstract

Motivation: Automatic biomedical named entity recognition (BioNER) is a key task in biomedical information extraction. For some time, state-of-the-art BioNER has been dominated by machine learning methods, particularly conditional random fields (CRFs), with a recent focus on deep learning. However, recent work has suggested that the high performance of CRFs for BioNER may not generalize to corpora other than the one it was trained on. In our analysis, we find that a popular deep learning-based approach to BioNER, known as bidirectional long short-term memory network-conditional random field (BiLSTM-CRF), is correspondingly poor at generalizing. To address this, we evaluate three modifications of BiLSTM-CRF for BioNER to improve generalization: improved regularization via variational dropout, transfer learning and multi-task learning.

Results: We measure the effect that each strategy has when training/testing on the same corpus ('in-corpus' performance) and when training on one corpus and evaluating on another ('out-of-corpus' performance), our measure of the model's ability to generalize. We found that variational dropout improves out-of-corpus performance by an average of 4.62%, transfer learning by 6.48% and multi-task learning by 8.42%. The maximal increase we identified combines multi-task learning and variational dropout, which boosts out-of-corpus performance by 10.75%. Furthermore, we make available a new open-source tool, called Saber that implements our best BioNER models.

Availability and implementation: Source code for our biomedical IE tool is available at https://github.com/BaderLab/saber. Corpora and other resources used in this study are available at https://github.com/BaderLab/Towards-reliable-BioNER.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Violin plot of the average in-corpus (IC) and out-of-corpus (OOC) performance, measured by F1 score, of the BiLSTM-CRF model. IC performance is derived from 5-fold cross-validation, using exact matching criteria. OOC performance is derived by training on one corpus (train) and testing on another corpus annotated for the same entity type (test) using a relaxed, right-boundary matching criterion. The average performance of a model employing one of each of the proposed modifications: variational dropout (VD), transfer learning (TL) and multi-task learning (MTL) independently as well as models which employ all combinations of these methods are shown

Similar articles

Cited by

References

    1. Baxter J. et al. (2000) A model of inductive bias learning. J. Artif. Intell. Res., 12, 3.
    1. Bayer J. et al. (2013) On fast dropout and its applicability to recurrent networks. arXiv preprint arXiv: 1311.0701.
    1. Campos D. et al. (2013a) Gimli: open source and high-performance biomedical name recognition. BMC Bioinformatics, 14, 54.. - PMC - PubMed
    1. Campos D. et al. (2013b) A modular framework for biomedical concept recognition. BMC Bioinformatics, 14, 281.. - PMC - PubMed
    1. Caruana R. (1993) Multitask learning: a knowledge-based source of inductive bias. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 41–48. Morgan Kaufmann, Citeseer.

Publication types