Studies using kappa statistics have been conducted with a varied but limited number of observers. The aim of this study was to evaluate the significance of multiple observers on kappa as a measure of observer variation. One hundred orthopedic specialists were asked to assess a random sample of ten sets of standard radiographs of 94 consecutive patients with ankle fractures. The observers were randomly allocated into four groups, which again were divided into subgroups with an increasing number of observers. Random subgroups of three observers revealed kappa values from 0.20 to 0.64 in the Lauge-Hansen and 0.27 to 0.90 in the Weber classification system. With an increasing number of observers in the subgroups, kappa stabilizes around a mean value, indicating that the sampling variation and standard error decrease. The standard error found in this study makes kappa questionable as a measure for agreement among a small number of observers. Thus, kappa values obtained for a given diagnostic tool at one department are not directly comparable with results from other departments. We conclude that kappa cannot stand alone as a simple measure of observer variation.