Textquest: document clustering of Medline abstracts for concept discovery in molecular biology

Pac Symp Biocomput. 2001:384-95. doi: 10.1142/9789814447362_0038.

Abstract

We present an algorithm for large-scale document clustering of biological text, obtained from Medline abstracts. The algorithm is based on statistical treatment of terms, stemming, the idea of a 'go-list', unsupervised machine learning and graph layout optimization. The method is flexible and robust, controlled by a small number of parameter values. Experiments show that the resulting document clusters are meaningful as assessed by cluster-specific terms. Despite the statistical nature of the approach, with minimal semantic analysis, the terms provide a shallow description of the document corpus and support concept discovery.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Abstracting and Indexing*
  • Algorithms*
  • Animals
  • Artificial Intelligence
  • Cluster Analysis
  • Drosophila / embryology
  • MEDLINE*
  • Molecular Biology*
  • Terminology as Topic