A combined data mining approach for infrequent events: analyzing HIV mutation changes based on treatment history

Comput Syst Bioinformatics Conf. 2006;385-8.


Many biological databases contain a large number of variables, among which events of interest may be very infrequent. Using a single data mining method to analyze such databases may not find adequate predictors. The HIV Drug Resistance Database at Stanford University stores sequential HIV-1 genotype-test results on patients taking antiretroviral drugs. We have analyzed the infrequent event of gene mutation changes by combining three data mining methods. We first use association rule analysis to scan through the database and identify potentially interesting mutation patterns with relatively high frequency. Next, we use logistic regression and classification trees to further investigate these patterns by analyzing the relationship between treatment history and mutation changes. Although the AUC measures of the overall prediction is not very high, our approach can effectively identify strong predictors of mutation change and thus focus the analytic efforts of researchers in verifying these results.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Anti-Retroviral Agents / therapeutic use
  • Area Under Curve
  • Computational Biology / methods*
  • DNA Mutational Analysis
  • Databases, Factual
  • Databases, Genetic*
  • Genotype
  • HIV / genetics*
  • HIV Infections / genetics*
  • HIV Infections / therapy*
  • HIV Seropositivity / genetics
  • HIV Seropositivity / therapy
  • Humans
  • Models, Statistical
  • Mutation*
  • Regression Analysis
  • Time Factors


  • Anti-Retroviral Agents