Extending Classification Algorithms to Case-Control Studies

Bryan Stanfill; Sarah Reehl; Lisa Bramer; Ernesto S Nakayasu; Stephen S Rich; Thomas O Metz; Marian Rewers; Bobbie-Jo Webb-Robertson; TEDDY Study Group

doi:10.1177/1179597219858954

Extending Classification Algorithms to Case-Control Studies

Biomed Eng Comput Biol. 2019 Jul 15:10:1179597219858954. doi: 10.1177/1179597219858954. eCollection 2019.

Authors

Bryan Stanfill¹, Sarah Reehl¹, Lisa Bramer¹, Ernesto S Nakayasu², Stephen S Rich³, Thomas O Metz², Marian Rewers⁴, Bobbie-Jo Webb-Robertson²; TEDDY Study Group

Affiliations

¹ Computing and Analytics Division, National Security Directorate, Pacific Northwest National Laboratory, Richland, WA, USA.
² Biological Sciences Division, Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, USA.
³ Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
⁴ Barbara Davis Center for Childhood Diabetes, University of Colorado Denver, Aurora, CO, USA.

Abstract

Classification is a common technique applied to 'omics data to build predictive models and identify potential markers of biomedical outcomes. Despite the prevalence of case-control studies, the number of classification methods available to analyze data generated by such studies is extremely limited. Conditional logistic regression is the most commonly used technique, but the associated modeling assumptions limit its ability to identify a large class of sufficiently complicated 'omic signatures. We propose a data preprocessing step which generalizes and makes any linear or nonlinear classification algorithm, even those typically not appropriate for matched design data, available to be used to model case-control data and identify relevant biomarkers in these study designs. We demonstrate on simulated case-control data that both the classification and variable selection accuracy of each method is improved after applying this processing step and that the proposed methods are comparable to or outperform existing variable selection methods. Finally, we demonstrate the impact of conditional classification algorithms on a large cohort study of children with islet autoimmunity.

Keywords: Diabetes; biomarker discovery; machine learning; support vector machines; variable selection.

Abstract

Grants and funding