A thematic analysis of the AIDS literature

Pac Symp Biocomput. 2002:386-97.

Abstract

Faced with the need for human comprehension of any large collection of objects, a time honored approach has been to cluster the objects into groups of closely related objects. Individual groups are then summarized in some convenient manner to provide a more manageable view of the data. Such methods have been applied to document collections with mixed results. If a hard clustering of the data into mutually exclusive clusters is performed then documents are frequently forced into one cluster when they may contain important information that would also appropriately make them candidates for other clusters. If a soft clustering is used there still remains the problem of how to provide a useful summary of the data in a cluster. Here we introduce a new algorithm to produce a soft clustering of document collections that is based on the concept of a theme. A theme is conceptually a subject area that is discussed by multiple documents in the database. A theme has two potential representations that may be viewed as dual to each other. First it is represented by the set of documents that discuss the subject or theme and second it is also represented by the set of key terms that are typically used to discuss the theme. Our algorithm is an EM algorithm in which the term representation and the document representation are explicit components and each is used to refine the other in an alternating fashion. Upon convergence the term representation provides a natural summary of the document representation (the cluster). We describe how to optimize the themes produced by this process and give the results of applying the method to a database of over fifty thousand PubMed documents dealing with the subject of AIDS. How themes may improve access to a document collection is also discussed.

MeSH terms

  • Acquired Immunodeficiency Syndrome*
  • Algorithms
  • Cluster Analysis
  • Humans
  • MEDLINE*