Background: Performance assessments rely on human judgment, and are vulnerable to rater effects (e.g. leniency or harshness). Making valid inferences from performance ratings for high-stakes decisions requires the management of rater effects. A simple method for detecting extreme raters that does not require sophisticated statistical knowledge or software has been developed as part of the quality assurance process for objective structured clinical examinations (OSCEs). We believe it is applicable to a range of examinations that rely on human raters.
Methods: The method has three steps. First, extreme raters are identified by comparing individual rater means with the mean of all raters. A rater is deemed extreme if their mean was three standard deviations below (hawks) or above (doves) the overall mean. This criterion is adjustable. Second, the distribution of an extreme rater's scores was compared with the overall distribution for the station. This step mitigates a station effect. Third, the cohort of candidates seen by the rater is examined to ensure that any cohort effect is ruled out.
Results and implications: Of 3000+ raters, fewer than 0.3% have been identified as being extreme using the proposed criteria. Rater performance is being monitored on a regular basis, and the impact of these raters on candidate results will be considered before results are finalised. Extreme raters are contacted by the organisation to review their rating style. If this intervention fails to modify the rater's scoring pattern, the rater is no longer invited back. As more data are collected the organisation will assess them to inform the development of approaches to improve extreme rater performance.
© Blackwell Publishing Ltd 2013.