Background: Bias of clinical examiners against some types of candidate, based on characteristics such as sex or ethnicity, would represent a threat to the validity of an examination, since sex or ethnicity are 'construct-irrelevant' characteristics. In this paper we report a novel method for assessing sex and ethnic bias in over 2000 examiners who had taken part in the PACES and nPACES (new PACES) examinations of the MRCP(UK).
Method: PACES and nPACES are clinical skills examinations that have two examiners at each station who mark candidates independently. Differences between examiners cannot be due to differences in performance of a candidate because that is the same for the two examiners, and hence may result from bias or unreliability on the part of the examiners. By comparing each examiner against a 'basket' of all of their co-examiners, it is possible to identify examiners whose behaviour is anomalous. The method assessed hawkishness-doveishness, sex bias, ethnic bias and, as a control condition to assess the statistical method, 'even-number bias' (i.e. treating candidates with odd and even exam numbers differently). Significance levels were Bonferroni corrected because of the large number of examiners being considered.
Results: The results of 26 diets of PACES and six diets of nPACES were examined statistically to assess the extent of hawkishness, as well as sex bias and ethnicity bias in individual examiners. The control (odd-number) condition suggested that about 5% of examiners were significant at an (uncorrected) 5% level, and that the method therefore worked as expected. As in a previous study (BMC Medical Education, 2006, 6:42), some examiners were hawkish or doveish relative to their peers. No examiners showed significant sex bias, and only a single examiner showed evidence consistent with ethnic bias. A re-analysis of the data considering only one examiner per station, as would be the case for many clinical examinations, showed that analysis with a single examiner runs a serious risk of false positive identifications probably due to differences in case-mix and content-specificity.
Conclusions: In examinations where there are two independent examiners at a station, our method can assess the extent of bias against candidates with particular characteristics. The method would be far less sensitive in examinations with only a single examiner per station as examiner variance would be confounded with candidate performance variance. The method however works well when there is more than one examiner at a station and in the case of the current MRCP(UK) clinical examination, nPACES, found possible sex bias in no examiners and possible ethnic bias in only one.