Framework for making better predictions by directly estimating variables' predictivity

Adeline Lo; Herman Chernoff; Tian Zheng; Shaw-Hwa Lo

doi:10.1073/pnas.1616647113

Framework for making better predictions by directly estimating variables' predictivity

Proc Natl Acad Sci U S A. 2016 Dec 13;113(50):14277-14282. doi: 10.1073/pnas.1616647113. Epub 2016 Nov 29.

Authors

Adeline Lo¹, Herman Chernoff², Tian Zheng³, Shaw-Hwa Lo³

Affiliations

¹ Department of Politics, Princeton University, Princeton, NJ 08540.
² Department of Statistics, Harvard University, Cambridge, MA 02138; slo@stat.columbia.edu chernoff@stat.harvard.edu tz33@columbia.edu.
³ Department of Statistics, Columbia University, New York, NY 10027 slo@stat.columbia.edu chernoff@stat.harvard.edu tz33@columbia.edu.

Abstract

We propose approaching prediction from a framework grounded in the theoretical correct prediction rate of a variable set as a parameter of interest. This framework allows us to define a measure of predictivity that enables assessing variable sets for, preferably high, predictivity. We first define the prediction rate for a variable set and consider, and ultimately reject, the naive estimator, a statistic based on the observed sample data, due to its inflated bias for moderate sample size and its sensitivity to noisy useless variables. We demonstrate that the [Formula: see text]-score of the PR method of VS yields a relatively unbiased estimate of a parameter that is not sensitive to noisy variables and is a lower bound to the parameter of interest. Thus, the PR method using the [Formula: see text]-score provides an effective approach to selecting highly predictive variables. We offer simulations and an application of the [Formula: see text]-score on real data to demonstrate the statistic's predictive performance on sample data. We conjecture that using the partition retention and [Formula: see text]-score can aid in finding variable sets with promising prediction rates; however, further research in the avenue of sample-based measures of predictivity is much desired.

Keywords: high-dimensional data; prediction; predictivity; variable selection.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.