Inference and Prediction Diverge in Biomedicine

Patterns (N Y). 2020 Oct 8;1(8):100119. doi: 10.1016/j.patter.2020.100119. eCollection 2020 Nov 13.

Abstract

In the 20th century, many advances in biological knowledge and evidence-based medicine were supported by p values and accompanying methods. In the early 21st century, ambitions toward precision medicine place a premium on detailed predictions for single individuals. The shift causes tension between traditional regression methods used to infer statistically significant group differences and burgeoning predictive analysis tools suited to forecast an individual's future. Our comparison applies linear models for identifying significant contributing variables and for finding the most predictive variable sets. In systematic data simulations and common medical datasets, we explored how variables identified as significantly relevant and variables identified as predictively relevant can agree or diverge. Across analysis scenarios, even small predictive performances typically coincided with finding underlying significant statistical relationships, but not vice versa. More complete understanding of different ways to define "important" associations is a prerequisite for reproducible research and advances toward personalizing medical care.

Keywords: data science; explainable AI; reproducibility; scientific discovery; variable importance.