A pilot study on the application of statistical classification procedures to molecular epidemiological data

Toxicol Lett. 2004 Jun 15;151(1):291-9. doi: 10.1016/j.toxlet.2004.02.021.


The development of new statistical methods for use in molecular epidemiology comprises the building and application of appropriate classification rules. The aim of this study was to assess various classification methods that can potentially handle genetic interactions. A data set comprising genotypes at 25 single nucleotide polymorphic (SNP) loci from 518 breast cancer cases and 586 age-matched population-based controls from the GENICA study was used to built a classification rule with the discrimination methods SVM (support vector machine), CART (classification and regression tree), Bagging, Random Forest, LogitBoost and k nearest neighbours (kNN). A blind pilot analysis of the genotypic data set was a first approach to obtain an impression of the statistical structure of the data. Furthermore, this analysis was performed to explore classification methods that may be applied to molecular-epidemiological evaluation. The results showed that all blindly applied classification methods had a slightly smaller misclassification rate than a random classification. The findings, nevertheless, suggest that SNP data might be useful for the classification of individuals into categories of high or low risk of diseases.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Breast Neoplasms / epidemiology
  • Breast Neoplasms / genetics
  • Case-Control Studies
  • Data Interpretation, Statistical*
  • Discriminant Analysis
  • Female
  • Humans
  • Molecular Epidemiology / methods*
  • Pilot Projects
  • Polymorphism, Genetic
  • Polymorphism, Single Nucleotide / genetics