Objectives: To consider methods and related evidence for evaluating bias in non-randomised intervention studies.
Data sources: Systematic reviews and methodological papers were identified from a search of electronic databases; handsearches of key medical journals and contact with experts working in the field. New empirical studies were conducted using data from two large randomised clinical trials.
Methods: Three systematic reviews and new empirical investigations were conducted. The reviews considered, in regard to non-randomised studies, (1) the existing evidence of bias, (2) the content of quality assessment tools, (3) the ways that study quality has been assessed and addressed. (4) The empirical investigations were conducted generating non-randomised studies from two large, multicentre randomised controlled trials (RCTs) and selectively resampling trial participants according to allocated treatment, centre and period.
Results: In the systematic reviews, eight studies compared results of randomised and non-randomised studies across multiple interventions using meta-epidemiological techniques. A total of 194 tools were identified that could be or had been used to assess non-randomised studies. Sixty tools covered at least five of six pre-specified internal validity domains. Fourteen tools covered three of four core items of particular importance for non-randomised studies. Six tools were thought suitable for use in systematic reviews. Of 511 systematic reviews that included non-randomised studies, only 169 (33%) assessed study quality. Sixty-nine reviews investigated the impact of quality on study results in a quantitative manner. The new empirical studies estimated the bias associated with non-random allocation and found that the bias could lead to consistent over- or underestimations of treatment effects, also the bias increased variation in results for both historical and concurrent controls, owing to haphazard differences in case-mix between groups. The biases were large enough to lead studies falsely to conclude significant findings of benefit or harm. Four strategies for case-mix adjustment were evaluated: none adequately adjusted for bias in historically and concurrently controlled studies. Logistic regression on average increased bias. Propensity score methods performed better, but were not satisfactory in most situations. Detailed investigation revealed that adequate adjustment can only be achieved in the unrealistic situation when selection depends on a single factor.
Conclusions: Results of non-randomised studies sometimes, but not always, differ from results of randomised studies of the same intervention. Non-randomised studies may still give seriously misleading results when treated and control groups appear similar in key prognostic factors. Standard methods of case-mix adjustment do not guarantee removal of bias. Residual confounding may be high even when good prognostic data are available, and in some situations adjusted results may appear more biased than unadjusted results. Although many quality assessment tools exist and have been used for appraising non-randomised studies, most omit key quality domains. Healthcare policies based upon non-randomised studies or systematic reviews of non-randomised studies may need re-evaluation if the uncertainty in the true evidence base was not fully appreciated when policies were made. The inability of case-mix adjustment methods to compensate for selection bias and our inability to identify non-randomised studies that are free of selection bias indicate that non-randomised studies should only be undertaken when RCTs are infeasible or unethical. Recommendations for further research include: applying the resampling methodology in other clinical areas to ascertain whether the biases described are typical; developing or refining existing quality assessment tools for non-randomised studies; investigating how quality assessments of non-randomised studies can be incorporated into reviews and the implications of individual quality features for interpretation of a review's results; examination of the reasons for the apparent failure of case-mix adjustment methods; and further evaluation of the role of the propensity score.