How well do workplace-based assessments support summative entrustment decisions? A multi-institutional generalisability study

Med Educ. 2024 Jan 2. doi: 10.1111/medu.15291. Online ahead of print.


Background: Assessment of the Core Entrustable Professional Activities for Entering Residency requires direct observation through workplace-based assessments (WBAs). Single-institution studies have demonstrated mixed findings regarding the reliability of WBAs developed to measure student progression towards entrustment. Factors such as faculty development, rater engagement and scale selection have been suggested to improve reliability. The purpose of this investigation was to conduct a multi-institutional generalisability study to determine the influence of specific factors on reliability of WBAs.

Methods: The authors analysed WBA data obtained for clerkship-level students across seven institutions from 2018 to 2020. Institutions implemented a variety of strategies including selection of designated assessors, altered scales and different EPAs. Data were aggregated by these factors. Generalisability theory was then used to examine the internal structure validity evidence of the data. An unbalanced cross-classified random-effects model was used to decompose variance components. A phi coefficient of >0.7 was used as threshold for acceptable reliability.

Results: Data from 53 565 WBAs were analysed, and a total of 77 generalisability studies were performed. Most data came from EPAs 1 (n = 17 118, 32%) 2 (n = 10 237, 19.1%), and 6 (n = 6000, 18.5%). Low variance attributed to the learner (<10%) was found for most (59/77, 76%) analyses, resulting in a relatively large number of observations required for reasonable reliability (range = 3 to >560, median = 60). Factors such as DA, scale or EPA were not consistently associated with improved reliability.

Conclusion: The results from this study describe relatively low reliability in the WBAs obtained across seven sites. Generalisability for these instruments may be less dependent on factors such as faculty development, rater engagement or scale selection. When used for formative feedback, data from these instruments may be useful. However, such instruments do not consistently provide reasonable reliability to justify their use in high-stakes summative entrustment decisions.