A supervised approach for predicting patient survival with gene expression data

Proc IEEE Int Symp Bioinformatics Bioeng. 2010;2010(5521718):26-31. doi: 10.1109/BIBE.2010.14.

Abstract

Rapid development in genomics in recent years has allowed the simultaneous measurement of the expression levels of thousands of genes using DNA microarrays. This has offered tremendous potential for growth in our understanding of the pathophysiology of many diseases. When microarray studies also contain information about an outcome variable such as time to an event or death, one of the goals of an investigator is to understand how the expression levels of genes (covariates) relate to the time-to-event (referred to as survival time) in the course of a disease.In this article, we consider the case where the number of covariates, p, exceeds the number of observations, N, a setting typical of microarray gene expression data. For a given vector of responses representing survival times of N subjects and the corresponding p × N gene expression matrix, we examine the problem of predicting the survival probability when N ≪ p. This is an ill-conditioned problem further compounded by the presence of possibly censored survival times. We propose a model that combines the partial least squares approach for dimensionality reduction with the accelerated failure time model, a widely used log-linear model for linking censored survival time to covariates. We develop parametric methods to account for censoring as well as for predicting patient survival probabilities. We illustrate the applicability of our methods using cancer microarray data and explore the biological relevance of our results using pathway analysis. Finally, we evaluate the performance of our methods using extensive simulation studies.