An Evaluation of Interrater Reliability Measures on Binary Tasks Using d-Prime

Appl Psychol Meas. 2017 Jun;41(4):264-276. doi: 10.1177/0146621616684584. Epub 2016 Dec 29.

Abstract

Many indices of interrater agreement on binary tasks have been proposed to assess reliability, but none has escaped criticism. In a series of Monte Carlo simulations, five such indices were evaluated using d-prime, an unbiased indicator of raters' ability to distinguish between the true presence or absence of the characteristic being judged. Phi and, to a lesser extent, Kappa coefficients performed best across variations in characteristic prevalence, and raters' expertise and bias. Correlations with d-prime for Percentage Agreement, Scott's Pi, and Gwet's AC1 were markedly lower. In situations where two raters make a series of binary judgments, the findings suggest that researchers should choose Phi or Kappa to assess interrater agreement as the superiority of these indices was least influenced by variations in the decision environment and characteristics of the decision makers.

Keywords: Kappa; Percentage Agreement; Phi correlation; interrater agreement; reliability; research methods.