Background: Identification of prognostic gene expression markers from clinical cohorts might help to better understand disease etiology. A set of potentially important markers can be automatically selected when linking gene expression covariates to a clinical endpoint by multivariable regression models and regularized parameter estimation. However, this is hampered by instability due to selection from many measurements. Stability can be assessed by resampling techniques, which might guide modeling decisions, such as choice of the model class or the specific endpoint definition.
Methods: We specifically propose a strategy for judging the impact of different endpoint definitions, endpoint updates, different approaches for marker selection, and exclusion of outliers. This strategy is illustrated for a study with end-stage renal disease patients, who experience a yearly mortality of more than 20 %, with almost 50 % sudden cardiac death or myocardial infarction. The underlying etiology is poorly understood, and we specifically point out how our strategy can help to identify novel prognostic markers and targets for therapeutic interventions.
Results: For markers such as the potentially prognostic platelet glycoprotein IIb, the endpoint definition, in combination with the signature building approach is seen to have the largest impact. Removal of outliers, as identified by the proposed strategy, is also seen to considerably improve stability.
Conclusions: As the proposed strategy allowed us to precisely quantify the impact of modeling choices on the stability of marker identification, we suggest routine use also in other applications to prevent analysis-specific results, which are unstable, i.e. not reproducible.
Keywords: Clinical endpoint; Outlier; Prognostic signature; Stability.