Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 2016

Discovering Biomedical Semantic Relations in PubMed Queries for Information Retrieval and Database Curation

Affiliations

Discovering Biomedical Semantic Relations in PubMed Queries for Information Retrieval and Database Curation

Chung-Chi Huang et al. Database (Oxford).

Abstract

Identifying relevant papers from the literature is a common task in biocuration. Most current biomedical literature search systems primarily rely on matching user keywords. Semantic search, on the other hand, seeks to improve search accuracy by understanding the entities and contextual relations in user keywords. However, past research has mostly focused on semantically identifying biological entities (e.g. chemicals, diseases and genes) with little effort on discovering semantic relations. In this work, we aim to discover biomedical semantic relations in PubMed queries in an automated and unsupervised fashion. Specifically, we focus on extracting and understanding the contextual information (or context patterns) that is used by PubMed users to represent semantic relations between entities such as 'CHEMICAL-1 compared to CHEMICAL-2' With the advances in automatic named entity recognition, we first tag entities in PubMed queries and then use tagged entities as knowledge to recognize pattern semantics. More specifically, we transform PubMed queries into context patterns involving participating entities, which are subsequently projected to latent topics via latent semantic analysis (LSA) to avoid the data sparseness and specificity issues. Finally, we mine semantically similar contextual patterns or semantic relations based on LSA topic distributions. Our two separate evaluation experiments of chemical-chemical (CC) and chemical-disease (CD) relations show that the proposed approach significantly outperforms a baseline method, which simply measures pattern semantics by similarity in participating entities. The highest performance achieved by our approach is nearly 0.9 and 0.85 respectively for the CC and CD task when compared against the ground truth in terms of normalized discounted cumulative gain (nDCG), a standard measure of ranking quality. These results suggest that our approach can effectively identify and return related semantic patterns in a ranked order covering diverse bio-entity relations. To assess the potential utility of our automated top-ranked patterns of a given relation in semantic search, we performed a pilot study on frequently sought semantic relations in PubMed and observed improved literature retrieval effectiveness based on post-hoc human relevance evaluation. Further investigation in larger tests and in real-world scenarios is warranted.

Figures

Figure 1.
Figure 1.
Workflow and application of SIP.
Figure 2.
Figure 2.
Semantically similar pattern finding.
Figure 3.
Figure 3.
System performance on the CC task with different LSA topic numbers (10–150) and different numbers of the most frequent entity pairs (500–3000). Strict match is required. The solid line represents best-performing SIP while the dotted line represents the baseline.
Figure 4.
Figure 4.
System performance on the CD task with different LSA topic numbers (10–150) and different numbers of the most frequent entity pairs (500–3000). Strict match is required. The solid line represents best-performing SIP while the dotted line represents the baseline.
Figure 5.
Figure 5.
nDCG results on our (a) CC task and (b) CD task when both strict-match and relaxed-match are allowed. The solid line represents best-performing SIP while the dotted line represents the baseline.

Similar articles

See all similar articles

Cited by 1 article

References

    1. Islamaj Dogan R., Murray G.C., Neveol A. et al. (2009) Understanding PubMed user search behavior through log analysis. Database (Oxford), 2009, bap018. - PMC - PubMed
    1. Wei C.H., Peng Y., Leaman R. et al. (2015) Overview of the BioCreative V Chemical Disease Relation (CDR) Task. In: Proceedings of The fifth BioCreative challenge evaluation workshop, pp. 154–166.
    1. Krallinger M., Leitner F., Rodriguez-Penagos C. et al. (2008) Overview of the protein–protein interaction annotation extraction task of BioCreative II. Genome Biol., 9, S4. - PMC - PubMed
    1. Kim J.D., Ohta T., Pyysalo S. et al. (2009) Overview of BioNLP'09 shared task on event extraction. In: Proceedings of the Workshop on BioNLP: Shared Task, p. 1–9.
    1. Segura-Bedmar I, Martínez P., Sánchez-Cisneros D. 2011. The 1st DDIExtraction-2011 challenge task: extraction of drug–drug interactions from biomedical texts. In: Proceedings of the 1st challenge task on drug–drug interaction extraction, p. 1–9.

Publication types

Feedback