Leveraging Wikipedia knowledge to classify multilingual biomedical documents

Marcos Antonio Mouriño García; Roberto Pérez Rodríguez; Luis Anido Rifón

doi:10.1016/j.artmed.2018.04.007

Leveraging Wikipedia knowledge to classify multilingual biomedical documents

Artif Intell Med. 2018 Jun:88:37-57. doi: 10.1016/j.artmed.2018.04.007. Epub 2018 May 3.

Authors

Marcos Antonio Mouriño García¹, Roberto Pérez Rodríguez², Luis Anido Rifón³

Affiliations

¹ Department of Telematics Engineering, University of Vigo, Campus Lagoas-Marcosende, 36310 Vigo, Spain. Electronic address: marcos@gist.uvigo.es.
² Department of Telematics Engineering, University of Vigo, Campus Lagoas-Marcosende, 36310 Vigo, Spain. Electronic address: roberto.perez@gist.uvigo.es.
³ Department of Telematics Engineering, University of Vigo, Campus Lagoas-Marcosende, 36310 Vigo, Spain. Electronic address: lanido@gist.uvigo.es.

PMID: 29730047
DOI: 10.1016/j.artmed.2018.04.007

Abstract

This article presents a classifier that leverages Wikipedia knowledge to represent documents as vectors of concepts weights, and analyses its suitability for classifying biomedical documents written in any language when it is trained only with English documents. We propose the cross-language concept matching technique, which relies on Wikipedia interlanguage links to convert concept vectors between languages. The performance of the classifier is compared to a classifier based on machine translation, and two classifiers based on MetaMap. To perform the experiments, we created two multilingual corpus. The first one, Multi-Lingual UVigoMED (ML-UVigoMED) is composed of 23,647 Wikipedia documents about biomedical topics written in English, German, French, Spanish, Italian, Galician, Romanian, and Icelandic. The second one, English-French-Spanish-German UVigoMED (EFSG-UVigoMED) is composed of 19,210 biomedical abstract extracted from MEDLINE written in English, French, Spanish, and German. The performance of the approach proposed is superior to any of the state-of-the art classifier in the benchmark. We conclude that leveraging Wikipedia knowledge is of great advantage in tasks of multilingual classification of biomedical documents.

Keywords: Biomedical document classification; Hybrid word-concept document representation; Multilingual text classification; Wikipedia Miner semantic annotator; Wikipedia-based bag of concepts document representation.

Publication types

Comparative Study

MeSH terms

Biomedical Research / classification*
Data Mining / methods*
Documentation / classification*
Encyclopedias as Topic*
Humans
Knowledge Bases*
Multilingualism*
Natural Language Processing*
Semantics*