Background and objectives: The performance of a prediction model is usually worse in external validation data compared to the development data. We aimed to determine at which effective sample sizes (i.e., number of events) relevant differences in model performance can be detected with adequate power.
Methods: We used a logistic regression model to predict the probability that residual masses of patients treated for metastatic testicular cancer contained only benign tissue. We performed standard power calculations and Monte Carlo simulations to estimate the numbers of events that are required to detect several types of model invalidity with 80% power at the 5% significance level.
Results: A validation sample with 111 events was required to detect that a model predicted too high probabilities, when predictions were on average 1.5 times too high on the odds scale. A decrease in discriminative ability of the model, indicated by a decrease in the c-statistic from 0.83 to 0.73, required 81 to 106 events, depending on the specific scenario.
Conclusion: We suggest a minimum of 100 events and 100 nonevents for external validation samples. Specific hypotheses may, however, require substantially higher effective sample sizes to obtain adequate power.