Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Feb 13:14:51.
doi: 10.1186/1471-2105-14-51.

Discovery of novel biomarkers and phenotypes by semantic technologies

Affiliations

Discovery of novel biomarkers and phenotypes by semantic technologies

Carlo A Trugenberger et al. BMC Bioinformatics. .

Abstract

Background: Biomarkers and target-specific phenotypes are important to targeted drug design and individualized medicine, thus constituting an important aspect of modern pharmaceutical research and development. More and more, the discovery of relevant biomarkers is aided by in silico techniques based on applying data mining and computational chemistry on large molecular databases. However, there is an even larger source of valuable information available that can potentially be tapped for such discoveries: repositories constituted by research documents.

Results: This paper reports on a pilot experiment to discover potential novel biomarkers and phenotypes for diabetes and obesity by self-organized text mining of about 120,000 PubMed abstracts, public clinical trial summaries, and internal Merck research documents. These documents were directly analyzed by the InfoCodex semantic engine, without prior human manipulations such as parsing. Recall and precision against established, but different benchmarks lie in ranges up to 30% and 50% respectively. Retrieval of known entities missed by other traditional approaches could be demonstrated. Finally, the InfoCodex semantic engine was shown to discover new diabetes and obesity biomarkers and phenotypes. Amongst these were many interesting candidates with a high potential, although noticeable noise (uninteresting or obvious terms) was generated.

Conclusions: The reported approach of employing autonomous self-organising semantic engines to aid biomarker discovery, supplemented by appropriate manual curation processes, shows promise and has potential to impact, conservatively, a faster alternative to vocabulary processes dependent on humans having to read and analyze all the texts. More optimistically, it could impact pharmaceutical research, for example to shorten time-to-market of novel drugs, or speed up early recognition of dead ends and adverse reactions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
InfoCodex information map. InfoCodex information map obtained for the approximately 115,000 documents of the PubMed repository used for the present experiment. The size of the dots in the center of each class indicate the number of documents assigned to it.
Figure 2
Figure 2
Thomson Reuters obesity algorithm. Obesity example of Thomson Reuters algorithm for scoring matches between InfoCodex output (“All obesity records”) and Thomson Reuters knowledge bases.
Figure 3
Figure 3
PubMed results confidence level distribution. Confidence level distribution of candidates discovered by InfoCodex text mining of the experimental PubMed collection.
Figure 4
Figure 4
PubMed results confidence levels x I2E-manual precision. Correlation between InfoCodex confidence levels (Conf%; purple bars) and precision (light blue bars) against I2E-manual diabetes PubMed benchmark. Pink shading: exact match; yellow shading: partial match. Row 15 (100 Conf%) represents a member of the manually compiled reference set.
Figure 5
Figure 5
PubMed results confidence levels x UMLS match type. Confidence levels of novel InfoCodex biomarker/phenotype candidates from PubMed broken down by match type to UMLS terms (100% refers to the manually discovered reference/training set).
Figure 6
Figure 6
ClinicalTrials.gov results confidence levels x UMLS match type. Confidence levels of novel InfoCodex biomarker/phenotype candidates from ClinicalTrials.gov broken down by match type to UMLS terms (100% refers to the reference/training set).
Figure 7
Figure 7
Merck P3 results confidence levels × UMLS match type. Confidence levels of novel InfoCodex biomarker/phenotype candidates from Merck internal research documents broken down by match type to UMLS terms (100% indicates the reference/training set).
Figure 8
Figure 8
Novel candidates repository overlap. Overlap between novel InfoCodex biomarker/phenotype candidates from PubMed (PM), ClinicalTrials.gov (CT), and Merck internal research documents (P3). Lavender shading: found in one repository only; dark violet shading: found in all three; others: found in two.

Similar articles

Cited by

References

    1. The changing role of chemistry in drug discovery. Thomson Reuters: International Year of Chemistry (IYC 2011) report. http://www.thomsonreuters.com/content/science/pdf/ls/iyc2011.pdf.
    1. Ranjan J. Applications of data mining techniques in the pharmaceutical industry. Technol: J Theor Appl Inf; 2005. pp. 61–67.
    1. Mattos N. IBM study. 2005. http://news.cnet.com/IBM-dives-deeper-into-corporate-search/2100-7344_3-....
    1. Schneider G. Virtual screening: an endless staircase? Nat Rev Drug Discov. 2010;9:273–276. doi: 10.1038/nrd3139. - DOI - PubMed
    1. Hahn U, Cohen KB, Garten Y, Shah NH. Mining the pharmacogenomics literature: a survey of the state of the art. Brief Bioinform. 2012;13(4):460–494. doi: 10.1093/bib/bbs018. - DOI - PMC - PubMed

Publication types