Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. May-Jun 2007;14(3):253-63.
doi: 10.1197/jamia.M2233. Epub 2007 Feb 28.

Essie: A Concept-Based Search Engine for Structured Biomedical Text

Affiliations
Free PMC article

Essie: A Concept-Based Search Engine for Structured Biomedical Text

Nicholas C Ide et al. J Am Med Inform Assoc. .
Free PMC article

Abstract

This article describes the algorithms implemented in the Essie search engine that is currently serving several Web sites at the National Library of Medicine. Essie is a phrase-based search engine with term and concept query expansion and probabilistic relevancy ranking. Essie's design is motivated by an observation that query terms are often conceptually related to terms in a document, without actually occurring in the document text. Essie's performance was evaluated using data and standard evaluation methods from the 2003 and 2006 Text REtrieval Conference (TREC) Genomics track. Essie was the best-performing search engine in the 2003 TREC Genomics track and achieved results comparable to those of the highest-ranking systems on the 2006 TREC Genomics track task. Essie shows that a judicious combination of exploiting document structure, phrase searching, and concept based query expansion is a useful approach for information retrieval in the biomedical domain.

Figures

Figure 1
Figure 1
Abstract diagram of Essie’s scoring algorithm. Term occurrences are weighted by: (1) the similarity to the user’s query, and (2) the importance of the field where they are found. For example, in a search for “heart attack,” a document with “heart attack” in the title (point A) would score higher than a document with “myocardial infarction” in the abstract (point B).
Figure 2
Figure 2
Index building and related preprocessing. Token adjacency indexes are derived from the corpus and support efficient searches for arbitrary phrases. Word variants are extracted primarily from the Unified Medical Language System (UMLS) (additional compound words and plurals are mined from the corpus), and are used in term expansion. Synonymy is extracted from the UMLS and is used for concept expansion.
Figure 3
Figure 3
Search processing. Queries are parsed to extract search syntax and search texts. Syntax operators can control query expansion, but the default is relaxation expansion, which extends concept and term expansion. Expansion results in a large set of variations of the original search text, all of which are searched as phrases. Hits in the corpus are collected, and the documents containing them are scored, ranked, and returned.
Figure 4
Figure 4
A search expansion tree. Leaf nodes load lists of occurrences (aka hits) for tokens as found in the token adjacency indexes. Adjacent and merge nodes build up multitoken phrase hits. The stretch operation extends hits to include optional extra tokens on the right. Evaluation of the entire tree produces hits for the term expansion of “non-hodgkin’s lymphoma.”

Similar articles

See all similar articles

Cited by 35 articles

See all "Cited by" articles

Publication types

Feedback