In a previous study we demonstrated that automatic retrospective registration algorithms can frequently register magnetic resonance (MR) and computed tomography (CT) images of the brain with an accuracy of better than 2 mm, but in that same study we found that such algorithms sometimes fail, leading to errors of 6 mm or more. Before these algorithms can be used routinely in the clinic, methods must be provided for distinguishing between registration solutions that are clinically satisfactory and those that are not. One approach is to rely on a human observer to inspect the registration results and reject images that have been registered with insufficient accuracy. In this paper, we present a methodology for evaluating the efficacy of the visual assessment of registration accuracy. Since the clinical requirements for level of registration accuracy are likely to be application dependent, we have evaluated the accuracy of the observer's estimate relative to six thresholds: 1-6 mm. The performance of the observers was evaluated relative to the registration solution obtained using external fiducial markers that are screwed into the patient's skull and that are visible in both MR and CT images. This fiducial marker system provides the gold standard for our study. Its accuracy is shown to be approximately 0.5 mm. Two experienced, blinded observers viewed five pairs of clinical MR and CT brain images, each of which had each been misregistered with respect to the gold standard solution. Fourteen misregistrations were assessed for each image pair with misregistration errors distributed between 0 and 10 mm with approximate uniformity. For each misregistered image pair each observer estimated the registration error (in millimeters) at each of five locations distributed around the head using each of three assessment methods. These estimated errors were compared with the errors as measured by the gold standard to determine agreement relative to each of the six thresholds, where agreement means that the two errors lie on the same side of the threshold. The effect of error in the gold standard itself is taken into account in the analysis of the assessment methods. The results were analyzed by means of the Kappa statistic, the agreement rate, and the area of receiver-operating-characteristic (ROC) curves. No assessment performed well at 1 mm, but all methods performed well at 2 mm and higher. For these five thresholds, two methods agreed with the standard at least 80% of the time and exhibited mean ROC areas greater than 0.84. One of these same methods exhibited Kappa statistics that indicated good agreement relative to chance (Kappa > 0.6) between the pooled observers and the standard for these same five thresholds. Further analysis demonstrates that the results depend strongly on the choice of the distribution of misregistration errors presented to the observers.