Study design: Pragmatic, cross-sectional study.
Objective: To assess the interrater reliability of 3 tools used by the Cervical Overview Group (COG) for the assessment of the internal validity of randomized controlled trials (RCTs): Jadad, van Tulder, and risk of bias (RoB).
Summary of background data: For clinicians to implement evidence-based practice, they need to critically appraise health care literature. Checklists, scales, and domain-based criteria exist to evaluate the internal validity of RCTs for rehabilitation studies, but there is a lack of research reporting the reliability of existing assessment tools.
Methods: Four members of the COG with multiprofessional and methodological background independently evaluated the internal validity of 54 RCTs using prepiloted Jadad and van Tulder reporting forms, and 18 RCTs using RoB, from June 2003 to May 2009. The κ statistic was calculated for each combination of raters and assessment tools. Standard agreement categorizations were used.
Results: For Jadad, 4 of 7 items demonstrated mean κ statistic ranges from moderate to substantial agreement (mean values, 0.42-0.78), as did 8 of 11 items on the van Tulder tool (mean values, 0.44-0.77). The RoB demonstrated moderate to substantial (mean values, 0.56-0.76) agreement on 3 of 12 items. Consistent substantial agreement was found across all assessment tools for the domain "allocation concealment": Jadad 0.69 (mean range, 0.60-0.77); van Tulder 0.77 (mean range, 0.73-0.81); RoB 0.76 (mean range, 0.65-0.88); and moderate to substantial across 2 tools for the domain "sequence generation": van Tulder 0.53 (mean range, 0.37-0.66) and RoB 0.66 (mean range, 0.45-0.88). Other domains demonstrated slight or fair agreement.
Conclusion: Consistent interrater agreement was found across the 3 assessment tools for allocation concealment and for 2 tools for sequence generation. However, users should acknowledge that moderate variation exists within other items requiring more judgment. When evaluating rehabilitation RCTs, clinicians should consider limitations of rating certain items within the selected assessment tool.