Creating a text data-mining application for use in public health informatics

Conf Proc IEEE Eng Med Biol Soc. 2004;2004:3214-6. doi: 10.1109/IEMBS.2004.1403905.


Recent litigation and the Master Settlement Agreement of 1998 have made millions of tobacco industry internal documents available on the Internet ( The Legacy interface, housed at the University of California, San Francisco, is based on a traditional information retrieval model in which documents are indexed and retrieved based on user-specified queries. One problem with the Legacy interface is information overload. In an attempt to ease this problem, we are developing a text-mining interface to enable exploratory analysis and discovery of information from collections of data. Users could uncover new patterns and concepts and thus text mining could result in searches that are targeted and specific, which would decrease information overload. In order to determine information needs, nine in-depth interviews with regular users of the Legacy interface were conducted. Results show that participants identified clustering as a useful tool in identifying and extracting key concepts and identified the need to recognize relationships between terms and concepts within the data. We encourage researchers who are developing text-mining interfaces to survey the users to learn what particular aspects of their research could be enhanced by text mining.