Textquest: document clustering of Medline abstracts for concept discovery in molecular biology

I Iliopoulos; A J Enright; C A Ouzounis

doi:10.1142/9789814447362_0038

Textquest: document clustering of Medline abstracts for concept discovery in molecular biology

Pac Symp Biocomput. 2001:384-95. doi: 10.1142/9789814447362_0038.

Authors

I Iliopoulos¹, A J Enright, C A Ouzounis

Affiliation

¹ Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK. ioannis@ebi.ac.uk

PMID: 11262957
DOI: 10.1142/9789814447362_0038

Abstract

We present an algorithm for large-scale document clustering of biological text, obtained from Medline abstracts. The algorithm is based on statistical treatment of terms, stemming, the idea of a 'go-list', unsupervised machine learning and graph layout optimization. The method is flexible and robust, controlled by a small number of parameter values. Experiments show that the resulting document clusters are meaningful as assessed by cluster-specific terms. Despite the statistical nature of the approach, with minimal semantic analysis, the terms provide a shallow description of the document corpus and support concept discovery.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Abstracting and Indexing*
Algorithms*
Animals
Artificial Intelligence
Cluster Analysis
Drosophila / embryology
MEDLINE*
Molecular Biology*
Terminology as Topic