Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression

Joanna F Dipnall; Julie A Pasco; Michael Berk; Lana J Williams; Seetal Dodd; Felice N Jacka; Denny Meyer

doi:10.1371/journal.pone.0148195

Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression

PLoS One. 2016 Feb 5;11(2):e0148195. doi: 10.1371/journal.pone.0148195. eCollection 2016.

Authors

Joanna F Dipnall^{1

2}, Julie A Pasco^{1

3

4

5}, Michael Berk^{1

5

6

7

8}, Lana J Williams¹, Seetal Dodd^{1

5

6}, Felice N Jacka^{1

6

9

10}, Denny Meyer²

Affiliations

¹ IMPACT Strategic Research Centre, School of Medicine, Deakin University, Geelong, VIC, Australia.
² Department of Statistics, Data Science and Epidemiology, Swinburne University of Technology, Melbourne, VIC, Australia.
³ Department of Medicine, The University of Melbourne, St Albans, VIC, Australia.
⁴ Department of Epidemiology and Preventive Medicine, Monash University, Melbourne, VIC, Australia.
⁵ University Hospital Geelong, Barwon Health, Geelong, VIC, Australia.
⁶ Department of Psychiatry, The University of Melbourne, Parkville, VIC, Australia.
⁷ Florey Institute of Neuroscience and Mental Health, Parkville, VIC, Australia.
⁸ Orygen, the National Centre of Excellence in Youth Mental Health, Parkville, VIC, Australia.
⁹ Centre for Adolescent Health, Murdoch Children's Research Institute, Melbourne, Australia.
¹⁰ Black Dog Institute, Sydney, Australia.

Abstract

Background: Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study.

Methods: The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators.

Results: After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001).

Conclusion: The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Adolescent
Adult
Aged
Aged, 80 and over
Algorithms
Bilirubin / blood*
Biomarkers / analysis
Biomarkers / blood
Blood Glucose / physiology*
Data Mining / methods*
Depressive Disorder / blood*
Depressive Disorder / pathology
Depressive Disorder / psychology
Erythrocyte Indices / physiology*
Female
Humans
Logistic Models
Machine Learning*
Male
Middle Aged
Multivariate Analysis
Nutrition Surveys
Young Adult

Substances

Biomarkers
Blood Glucose
Bilirubin

Grants and funding

Michael Berk is supported by a NHMRC Senior Principal Research Fellowship 1059660 and Lana J Williams is supported by a NHMRC Career Development Fellowship 1064272. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.