Histologic grading systems are used to guide diagnosis, therapy, and audit on an international basis. The reproducibility of grading systems is usually tested within small groups of pathologists who have previously worked or trained together. This may underestimate the international variation of scoring systems. We therefore evaluated the reproducibility of an established system, the Banff classification of renal allograft pathology, throughout Europe. We also sought to improve reproducibility by providing individual feedback after each of 14 small groups of cases. Kappa values for all features studied were lower than any previously published, confirming that international variation is greater than interobserver variation as previously assessed. A prolonged attempt to improve reproducibility, using numeric or graphical feedback, failed to produce any detectable improvement. We then asked participants to grade selected photographs, to eliminate variation induced by pathologists viewing different areas of the slide. This produced improved kappa values only for some features. Improvement was influenced by the nature of the grade definitions. Definitions based on "area affected" by a process were not improved. The results indicate the danger of basing decisions on grading systems that may be applied very differently in different institutions.