Mining the literature for genes associated with placenta-mediated maternal diseases

AMIA Annu Symp Proc. 2018 Apr 16;2017:1498-1506. eCollection 2017.


Automated literature analysis could significantly speed up understanding of the role of the placenta and the impact of its development and functions on the health of the mother and the child. To facilitate automatic extraction of information about placenta-mediated disorders from the literature, we manually annotated genes and proteins, the associated diseases, and the functions and processes involved in the development and function of placenta in a collection of PubMed/MEDLINE abstracts. We developed three baseline approaches to finding sentences containing this information: one based on supervised machine learning (ML) and two based on distant supervision: 1) using automated detection of named entities and 2) using MeSH. We compare the performance of several well-known supervised ML algorithms and identify two approaches, Support Vector Machines (SVM) and Generalized Linear Models (GLM), which yield up to 98% recall precision and F1 score. We demonstrate that distant supervision approaches could be used at the expense of missing up to 15% of relevant documents.

MeSH terms

  • Data Mining / methods*
  • Disease / genetics*
  • Female
  • Genotype*
  • Humans
  • Linear Models
  • Medical Subject Headings
  • Placenta* / physiology
  • Pregnancy
  • Pregnancy Complications / genetics*
  • Supervised Machine Learning*
  • Support Vector Machine*