Purpose: To determine the reliability of 3 scales for assessing soft tissue inflammatory and congestive signs associated with thyroid eye disease.
Methods: This was a multicentered prospective observational study, recruiting 55 adults with thyroid eye disease from 9 international centers. Six thyroid eye disease soft tissue features were measured; each sign graded using 3 scales (presence/absence [0-1], 3-point scale [0-2], and percentage [0-100]). Each eye was graded twice by 2 independent raters. Accuracy (fraction of agreement) was calculated between the 2 trials for each rater (intrarater reliability) and between raters for all trials (interrater reliability) to determine the most sensitive scale for each feature that maintained a threshold of agreement greater than 0.70. Trial, intrarater reliability, and interrater reliability were determined by accuracy measurement of agreement for each inflammatory/congestive feature.
Results: Fifty-five patients had 218 assessments for 6 thyroid eye disease metrics. The intrarater reliability for each feature was consistently better than the interrater reliabilities. Using an agreement of 0.70 or better, for the interrater tests, conjunctival and eyelid edema could be reliably measured using the 0-1 or 0-2 scale while conjunctival and eyelid redness could only be reliably measured with the binary 0-1 scale. Caruncular edema and superior conjunctival redness could not be measured reliably between 2 raters with any scale. The percentage scale had poor agreement unless slippage intervals of >20% were allowed on either side of the measurements.
Conclusions: Of the specific periocular soft tissue inflammatory features measured between raters in the Clinical Activity Score and Vision, Inflammation, Strabismus, Appearance scales, edema of the eyelids and conjunctiva could reliably be measured by both 0-1 and 0-2 scales, erythema of the eyelid and bulbar conjunctiva could reliably be measured only by the 0-1 scale, and the other parameters of superior bulbar erythema and caruncular edema were not reliably measured by any scale.