Data reduction for prediction: a case study on robust coding of age and family history for the risk of having a genetic mutation

Stat Med. 2007 Dec 30;26(30):5545-56. doi: 10.1002/sim.3119.


Data reduction is often desired in the development of a prediction model, for example for effects of age and family history in the identification of subjects having a genetic mutation. We aimed to evaluate a strategy for model simplification by robust coding of related predictors. We considered 898 patients suspected of having Lynch syndrome, which is caused primarily by mutations in the mismatch repair genes, MLH1 or MSH2. The presence of colorectal cancer (CRC) and endometrial cancer in patients and their relatives was related to mutation prevalence with logistic regression analysis. The performances of simplified and more complex models were quantified with a concordance statistic (c), which was corrected for optimism by cross-validation and bootstrapping. External validation was performed in 1016 patients. The first challenge was the coding of age at diagnosis of CRC, where we forced effects to be identical in patients, in 1st degree and in 2nd degree relatives, by taking the sum of the ages at diagnosis. As a further simplification, CRC diagnosis in 2nd degree relatives was weighted half that of 1st degree relatives. These data reduction approaches were also followed for endometrial cancer. The simplified model used 7 instead of 17 degrees of freedom (df) for a more complex model incorporating individual predictor effects. The optimism-corrected c was higher (0.79 instead of 0.77), but the external c was similar (0.78 for the simplified and more complex models). A stepwise selected model performed slightly worse (external c=0.77). In conclusion, a prediction model could be developed with relatively few df that captured effects of age at diagnosis across patients and relatives per type of cancer in the family. Such robust coding may especially be relevant for modeling in relatively small data sets.

MeSH terms

  • Adaptor Proteins, Signal Transducing / genetics
  • Age Factors*
  • Aged
  • Aged, 80 and over
  • Colorectal Neoplasms, Hereditary Nonpolyposis / epidemiology
  • Colorectal Neoplasms, Hereditary Nonpolyposis / genetics
  • DNA Mismatch Repair
  • Family Health
  • Genetic Carrier Screening / methods
  • Genetic Predisposition to Disease / etiology
  • Genetic Testing / methods*
  • Humans
  • Likelihood Functions
  • Logistic Models
  • Middle Aged
  • MutL Protein Homolog 1
  • MutL Proteins
  • Mutation*
  • Neoplasm Proteins / genetics
  • Nuclear Proteins / genetics
  • Pedigree*
  • Predictive Value of Tests*
  • Reproducibility of Results
  • Risk Factors


  • Adaptor Proteins, Signal Transducing
  • MLH1 protein, human
  • Neoplasm Proteins
  • Nuclear Proteins
  • PMS1 protein, human
  • MutL Protein Homolog 1
  • MutL Proteins