Objective: To evaluate the interrater reproducibility of scientific abstract review.
Design: Retrospective analysis.
Setting: Review for the 1991 Society of General Internal Medicine (SGIM) annual meeting.
Subjects: 426 abstracts in seven topic categories evaluated by 55 reviewers.
Measurements: Reviewers rated abstracts from 1 (poor) to 5 (excellent), globally and on three specific dimensions: interest to the SGIM audience, quality of methods, and quality of presentation. Each abstract was reviewed by five to seven reviewers. Each reviewer's ratings of the three dimensions were added to compute that reviewer's summary score for a given abstract. The mean of all reviewers' summary scores for an abstract, the final score, was used by SGIM to select abstracts for the meeting.
Results: Final scores ranged from 4.6 to 13.6 (mean = 9.9). Although 222 abstracts (52%) were accepted for publication, the 95% confidence interval around the final score of 300 (70.4%) of the 426 abstracts overlapped with the threshold for acceptance of an abstract. Thus, these abstracts were potentially misclassified. Only 36% of the variance in summary scores was associated with an abstract's identity, 12% with the reviewer's identity, and the remainder with idiosyncratic reviews of abstracts. Global ratings were more reproducible than summary scores.
Conclusion: Reviewers disagreed substantially when evaluating the same abstracts. Future meeting organizers may wish to rank abstracts using global ratings, and to experiment with structured review criteria and other ways to improve raters' agreement.