Background: This article describes the data mining analysis of a clinical exposure study of 3585 adult smokers and 1077 nonsmokers. The analysis focused on developing models for four biomarkers of potential harm (BOPH): white blood cell count (WBC), 24 h urine 8-epi-prostaglandin F2alpha (EPI8), 24 h urine 11-dehydro-thromboxane B2 (DEH11), and high-density lipoprotein cholesterol (HDL).
Methods: Random Forest was used for initial variable selection and Multivariate Adaptive Regression Spline was used for developing the final statistical models
Results: The analysis resulted in the generation of models that predict each of the BOPH as function of selected variables from the smokers and nonsmokers. The statistically significant variables in the models were: platelet count, hemoglobin, C-reactive protein, triglycerides, race and biomarkers of exposure to cigarette smoke for WBC (R-squared = 0.29); creatinine clearance, liver enzymes, weight, vitamin use and biomarkers of exposure for EPI8 (R-squared = 0.41); creatinine clearance, urine creatinine excretion, liver enzymes, use of Non-steroidal antiinflammatory drugs, vitamins and biomarkers of exposure for DEH11 (R-squared = 0.29); and triglycerides, weight, age, sex, alcohol consumption and biomarkers of exposure for HDL (R-squared = 0.39).
Conclusions: Levels of WBC, EPI8, DEH11 and HDL were statistically associated with biomarkers of exposure to cigarette smoking and demographics and life style factors. All of the predictors together explain 29%-41% of the variability in the BOPH.