Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jun 10;10:177.
doi: 10.1186/1471-2105-10-177.

Text-mining of PubMed Abstracts by Natural Language Processing to Create a Public Knowledge Base on Molecular Mechanisms of Bacterial Enteropathogens

Affiliations
Free PMC article

Text-mining of PubMed Abstracts by Natural Language Processing to Create a Public Knowledge Base on Molecular Mechanisms of Bacterial Enteropathogens

Sam Zaremba et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: The Enteropathogen Resource Integration Center (ERIC; http://www.ericbrc.org) has a goal of providing bioinformatics support for the scientific community researching enteropathogenic bacteria such as Escherichia coli and Salmonella spp. Rapid and accurate identification of experimental conclusions from the scientific literature is critical to support research in this field. Natural Language Processing (NLP), and in particular Information Extraction (IE) technology, can be a significant aid to this process.

Description: We have trained a powerful, state-of-the-art IE technology on a corpus of abstracts from the microbial literature in PubMed to automatically identify and categorize biologically relevant entities and predicative relations. These relations include: Genes/Gene Products and their Roles; Gene Mutations and the resulting Phenotypes; and Organisms and their associated Pathogenicity. Evaluations on blind datasets show an F-measure average of greater than 90% for entities (genes, operons, etc.) and over 70% for relations (gene/gene product to role, etc). This IE capability, combined with text indexing and relational database technologies, constitute the core of our recently deployed text mining application.

Conclusion: Our Text Mining application is available online on the ERIC website (http://www.ericbrc.org/portal/eric/articles). The information retrieval interface displays a list of recently published enteropathogen literature abstracts, and also provides a search interface to execute custom queries by keyword, date range, etc. Upon selection, processed abstracts and the entities and relations extracted from them are retrieved from a relational database and marked up to highlight the entities and relations. The abstract also provides links from extracted genes and gene products to the ERIC Annotations database, thus providing access to comprehensive genomic annotations and adding value to both the text-mining and annotations systems.

Figures

Figure 1
Figure 1
An overview of the ERIC Literature Text Mining population process.
Figure 2
Figure 2
(Left) The Latest Articles tab lists PubMed abstracts involving enteropathogens published over the previous 7 days. (Right) The Search tab supports query by keyword(s) and phrases, PMID, date range, and/or journal. The PMID link of a title retrieves the abstract in the ERIC text mining interface.
Figure 3
Figure 3
ERIC text mining interface of a PubMed abstract processed by NetOwl®.
Figure 4
Figure 4
Detail of the Relationships Extracted panel on the ERIC text mining interface.
Figure 5
Figure 5
Montage shows workflow from an extracted gene/gene products in the text-mining interface, to the ASAP annotations database.
Figure 6
Figure 6
Detailed Feature page in ERIC-ASAP. Community users viewing newly extracted information may alert ERIC via the Add a note to the curator button (inset).

Similar articles

See all similar articles

Cited by 9 articles

See all "Cited by" articles

References

    1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb J-F, Dougherty BA, Merrick JM, McKenney K, Sutton G, FitzHugh W, Fields C, Gocayne JD, Scott J, Shirley R, Liu L-I, Glodek A, Kelley JM, Weidman JF, Phillips CA, Spriggs T, Hedblom E, Cotton MD, Utterback TR, Hanna MC, Nguyen DT, Saudek DM, Brandon RC, Fine LD, Fritchman JL, Fuhrmann JL, Geoghagen NSM, Gnehm CL, McDonald LA, Small KV, Fraser CM, Smith HO, Venter JC. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. - PubMed
    1. Riley M, Abe T, Arnaud MB, Berlyn MK, Blattner FR, Chaudhuri RR, Glasner JD, Horiuchi T, Keseler IM, Kosuge T, Mori H, Perna NT, Plunkett G, 3rd, Rudd KE, Serres MH, Thomas GH, Thomson NR, Wishart D, Wanner BL. Escherichia coli K-12: a cooperatively developed annotation snapshot–2005. Nucleic Acids Res. 2006;34:1–9. - PMC - PubMed
    1. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A. UniProtKB/Swiss-Prot: The Manually Annotated Section of the UniProt KnowledgeBase. Methods Mol Biol. 2007;406:89–112. - PubMed
    1. Stothard P, Wishart DS. Automated bacterial genome analysis and annotation. Curr Opin Microbiol. 2006;9:505–510. - PubMed
    1. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O. The RAST Server: rapid annotations using subsystems technology. BMC Genomics. 2008;9:75. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Feedback