Integrative literature and data mining to rank disease candidate genes

Methods Mol Biol. 2014:1159:207-26. doi: 10.1007/978-1-4939-0709-0_12.


While the genomics-derived discoveries promise benefits to basic research and health care, the speed and affordability of sequencing following recent technological advances has further aggravated the data deluge. Seamless integration of the ever-increasing clinical, genomic, and experimental data and efficient mining for knowledge extraction, delivering actionable insight and generating testable hypotheses are therefore critical for the needs of biomedical research. For instance, high-throughput techniques are frequently applied to detect disease candidate genes. Experimental validation of these candidates however is both time-consuming and expensive. Hence, several computational approaches based on literature and data mining have been developed to identify the most promising candidates for follow-up studies. Based on "guilt by association" principle, most of these methods use prior knowledge about a disease of interest to discover and rank novel candidate genes. In this chapter, we provide a brief overview of recent advances made in literature- and data-mining-based approaches for candidate gene prioritization. As a case study, we focus on a Web-based computational approach that uses integrated heterogeneous data sources including gene-literature associations for ranking disease candidate genes and explain how to run typical queries using this system.

Publication types

  • Review

MeSH terms

  • Animals
  • Data Mining / methods*
  • Gene Ontology*
  • Genetic Association Studies / methods*
  • Genomics*
  • Humans