Entity recognition is an important but challenging research problem. In reality, many text collections are from specific, dynamic, or emerging domains, which poses significant new challenges for entity recognition with increase in name ambiguity and context sparsity, requiring entity detection without domain restriction. In this paper, we investigate entity recognition (ER) with distant-supervision and propose a novel relation phrase-based ER framework, called ClusType, that runs data-driven phrase mining to generate entity mention candidates and relation phrases, and enforces the principle that relation phrases should be softly clustered when propagating type information between their argument entities. Then we predict the type of each entity mention based on the type signatures of its co-occurring relation phrases and the type indicators of its surface name, as computed over the corpus. Specifically, we formulate a joint optimization problem for two tasks, type propagation with relation phrases and multi-view relation phrase clustering. Our experiments on multiple genres-news, Yelp reviews and tweets-demonstrate the effectiveness and robustness of ClusType, with an average of 37% improvement in F1 score over the best compared method.
Keywords: Entity Recognition and Typing; Relation Phrase Clustering.
Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach.KDD. 2015 Aug;2015:2319-2320. doi: 10.1145/2783258.2789988. KDD. 2015. PMID: 26705508 Free PMC article.
Automated Phrase Mining from Massive Text Corpora.IEEE Trans Knowl Data Eng. 2018 Oct;30(10):1825-1837. doi: 10.1109/TKDE.2018.2812203. Epub 2018 Mar 5. IEEE Trans Knowl Data Eng. 2018. PMID: 31105412 Free PMC article.
Liberal Entity Extraction: Rapid Construction of Fine-Grained Entity Typing Systems.Big Data. 2017 Mar;5(1):19-31. doi: 10.1089/big.2017.0012. Big Data. 2017. PMID: 28328252 Free PMC article.
Extracting entities with attributes in clinical text via joint deep learning.J Am Med Inform Assoc. 2019 Dec 1;26(12):1584-1591. doi: 10.1093/jamia/ocz158. J Am Med Inform Assoc. 2019. PMID: 31550346
Identifying non-elliptical entity mentions in a coordinated NP with ellipses.J Biomed Inform. 2014 Feb;47:139-52. doi: 10.1016/j.jbi.2013.10.002. Epub 2013 Oct 20. J Biomed Inform. 2014. PMID: 24153413
Cited by 2 articles
FacetGist: Collective Extraction of Document Facets in Large Technical Corpora.Proc ACM Int Conf Inf Knowl Manag. 2016 Oct;2016:871-880. doi: 10.1145/2983323.2983828. Proc ACM Int Conf Inf Knowl Manag. 2016. PMID: 28210517 Free PMC article.
A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters.Health Inf Sci Syst. 2015 Dec 9;3:5. doi: 10.1186/s13755-015-0013-y. eCollection 2015. Health Inf Sci Syst. 2015. PMID: 26664724 Free PMC article.