Named Entity Recognition for Chinese Cancer Electronic Health Records-Development and Evaluation of a Domain-Specific BERT Model: Quantitative Study

JMIR Med Inform. 2025 Nov 14:13:e76912. doi: 10.2196/76912.

Abstract

Background: The unstructured data of Chinese cancer electronic health records (EHRs) contains valuable medical expertise. Accurate medical entity recognition is crucial for building a medical-assisted decision system. Named entity recognition (NER) in cancer EHRs typically uses general models designed for English medical records. There is a lack of specialized handling for cancer-specific records and limited application to Chinese medical records.

Objective: This study aims to propose a specific NER model to enhance the recognition of medical entities in Chinese cancer EHRs.

Methods: Desensitized inpatient EHRs related to breast cancer were collected from a leading hospital in Beijing. Building upon the MC Bidirectional Encoder Representations from Transformers (BERT) foundation, the study further incorporated a Chinese cancer corpus for pretraining, resulting in the construction of the ChCancerBERT pretrained model. In conjunction with dilated-gated convolutional neural networks, bidirectional long short-term memory, multihead attention mechanism, and a conditional random field, this model forms a multimodel, multilevel integrated NER approach.

Results: This approach effectively extracts medical entity features related to symptoms, signs, tests, treatments, and time in Chinese breast cancer EHRs. The entity recognition performance of the proposed model surpasses that of the baseline model and other models compared in the experiment. The F1-score reached 86.93%, precision reached 87.24%, and recall reached 86.61%. The model introduced in this study demonstrates exceptional performance on the CCKS2019 dataset, attaining a precision rate of 87.26%, a recall rate of 87.27%, and an impressive F1-score of 87.26%, surpassing that of existing models.

Conclusions: The experiments demonstrate that the approach proposed in this study exhibits excellent performance in NER within breast cancer EHRs. This advancement will further contribute to clinical decision support for cancer treatment and research. In addition, the study reveals that incorporating domain-specific corpora in clinical NER tasks can further enhance the performance of BERT models in specialized domains.

Keywords: BERT; cancer; deep learning; electronic health records; named entity recognition.

MeSH terms

  • Breast Neoplasms*
  • China
  • East Asian People
  • Electronic Health Records*
  • Female
  • Humans
  • Natural Language Processing*
  • Neoplasms*
  • Neural Networks, Computer

Supplementary concepts

  • Chinese people