Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition

J Biomed Inform. 2020 Oct:110:103542. doi: 10.1016/j.jbi.2020.103542. Epub 2020 Aug 24.

Abstract

Objective: This study aims at realizing unsupervised term discovery in Chinese electronic health records (EHRs) by using the word segmentation technique. The existing supervised algorithms do not perform satisfactorily in the case of EHRs, as annotated medical data are scarce. We propose an unsupervised segmentation method (GTS) based on the graph partition principle, whose multi-granular segmentation capability can help realize efficient term discovery.

Methods: A sentence is converted to an undirected graph, with the edge weights based on n-gram statistics, and ratio cut is used to split the sentence into words. The graph partition is solved efficiently via dynamic programming, and multi-granularity is realized by setting different partition numbers. A BERT-based discriminator is trained using generated samples to verify the correctness of the word boundaries. The words that are not recorded in existing dictionaries are retained as potential medical terms.

Results: We compared the GTS approach with mature segmentation systems for both word segmentation and term discovery. MD students manually segmented Chinese EHRs at fine and coarse granularity levels and reviewed the term discovery results. The proposed unsupervised method outperformed all the competing algorithms in the word segmentation task. In term discovery, GTS outperformed the best baseline by 17 percentage points (a 47% relative percentage of increment) on F1-score.

Conclusion: In the absence of annotated training data, the graph partition technique can effectively use the corpus statistics and even expert knowledge to realize unsupervised word segmentation of EHRs. Multi-granular segmentation can be used to provide potential medical terms of various lengths with high accuracy.

Keywords: Electronic health records; Graph partition; Multi-granularity; Term discovery; Word segmentation.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • China
  • Electronic Health Records*
  • Humans
  • Language