Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;9(2):e1002854.
doi: 10.1371/journal.pcbi.1002854. Epub 2013 Feb 7.

Getting More Out of Biomedical Documents With GATE's Full Lifecycle Open Source Text Analytics

Affiliations
Free PMC article

Getting More Out of Biomedical Documents With GATE's Full Lifecycle Open Source Text Analytics

Hamish Cunningham et al. PLoS Comput Biol. .
Free PMC article

Abstract

This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online <1> under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The GATE developer interface.
Figure 2
Figure 2. GATE embedded APIs.
GATE provides a set of Java APIs, called GATE Embedded. This figure summarises the modules provided. Language resources (LRs) are data-only resources such as lexica, corpora or ontologies. Processing Resources (PRs) are principally programmatic or algorithmic. Visual resources (VRs) allow users to interact visually with other resources.
Figure 3
Figure 3. An annotation graph.
In GATE, annotations are encoded by associating features with character offsets, indicating the text to which they pertain.
Figure 4
Figure 4. Chinese annotations.
In GATE's document view, annotations are shown as highlighted sections of text. This figure shows Chinese text with highlighted annotations. The annotations are listed at the bottom, showing their type, offsets and features.
Figure 5
Figure 5. ANNIC (ANNotations In Context).
Complex queries are supported, such as a query that searches for person annotations followed by past tense verbs followed by organisation names, as shown in this figure. The query appears in the third line from the top; the patterns described are for people annotation followed by organisation annotations. All matching text ranges then appear in the lower half of the tool, with a graphical representation of the individual annotations concerned in the middle part.
Figure 6
Figure 6. Mímir index size.
As this figure shows, in later versions of Mímir, software improvements meant that the index could be reduced in size, allowing much larger document collections to be indexed.
Figure 7
Figure 7. Co-occurence search.
Faceted search allows users to apply multiple filters – here we have selected Hydralazine Hydrochloride as an Active Ingredient and started typing ‘AST’ in the Applicant column.

Similar articles

See all similar articles

Cited by 53 articles

See all "Cited by" articles

References

    1. Cohen KB, Hunter L (2008) Getting started in text mining. PLoS Comput Biol 4: e20. - PMC - PubMed
    1. Rzhetsky A, Seringhaus M, Gerstein MB (2009) Getting started in text mining: Part two. PLoS Comput Biol 5: e1000411. - PMC - PubMed
    1. Rodriguez-Esteban R (2009) Biomedical text mining and its applications. PLoS Comput Biol 5: e1000597. - PMC - PubMed
    1. Cunningham H, Maynard D, Bontcheva K, Tablan V (2002) Gate: an architecture for development of robust hlt applications. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 7–12 July 2002. Stroudsburg, PA, USA: Association for Computational Linguistics, ACL '02, pp. 168–175. doi:10.3115/1073083.1073112. URL http://gate.ac.uk/sale/acl02/acl-main.pdf.
    1. Cunningham H, Maynard D, Bontcheva K, Tablan V, Aswani N, et al. (2011) Text Processing with GATE (Version 6). The University of Sheffield Available: http://tinyurl.com/gatebook.

Publication types

Feedback