Objective: Despite guidelines recommending the use of formal tests of interaction in subgroup analyses in clinical trials, inappropriate subgroup-specific analyses continue. Moreover, trials designed to detect overall treatment effects have limited power to detect treatment-subgroup interactions. This article quantifies the error rates associated with subgroup analyses.
Study design and setting: Simulations quantified the risks of misinterpreting subgroup analyses as evidence of differential subgroup effects and the limited power of the interaction test in trials designed to detect overall treatment effects.
Results: Although formal interaction tests performed as expected with respect to false positives, subgroup-specific tests were considerably less reliable: A significant effect in one subgroup only was observed in 7% to 64% of simulations depending on trial characteristics. Regarding power of the interaction test, a trial with 80% power for the overall effect had only 29% power to detect an interaction effect of the same magnitude. For interactions of this size to be detected with the same power as the overall effect, sample sizes should be inflated fourfold, increasing dramatically for interactions smaller than 20% of the overall effect.
Conclusion: Although it is generally recognized that subgroup analyses can produce spurious results, the extent of the problem may be underestimated.