Background: Item response theory (IRT) scoring of health status questionnaires offers many advantages. However, to ensure 'backwards comparability' and to facilitate interpretations of results, we need the ability to express the IRT score in the metrics of the traditional scales.
Objectives: To develop procedures to calibrate IRT-based scores on the Headache Impact Test (HIT) into the metrics of the traditional headache scales. To assess the degree to which the calibrated HIT scores agree with the observed traditional scores and lead to the same conclusions in group comparisons.
Methods: We used telephone interview data (n = 1016) and Internet data (n = 1103) from general population surveys of recent headache sufferers. Analyses were conducted in four steps: (1) develop IRT models for all items, (2) for each IRT score level, calculate the expected score on each of the traditional scales (calibration), (3) adjust this calibrated score for measurement error in the IRT score, (4) for each of the traditional scales, assess agreement between calibrated HIT scores and observed scores using intraclass correlation (ICC) and evaluate the agreement of mean scores and the relative validity (RV) in discriminating among groups differing in migraine diagnosis, headache severity, and change in impact over time.
Results: For the traditional categorical questionnaire items (the Migraine Specific Questionnaire (MSQ) and the Headache Disability Inventory (HDI)) the calibrated HIT agreed with the observed traditional scores: ICC's were between 0.80 and 0.94. In RV analyses the maximum mean difference between the observed and expected scores was 1.7 points on a 0-100 scale for comparisons at one point in time. Analyses of change over time and analyses calibrating scores from the fixed-form HIT-6 to the metric of other questionnaires were also satisfactory although less precise. Analysis of non-standard questionnaire items (e.g. On how many days in the past 3 months did you have a headache, from the HIMQ and the MIDAS) required special IRT models. Agreement was less good: ICC's were between 0.56 and 0.61 and the maximum mean differences were 2.9 (on a 0-270 scale) and 3.8 (on a 0-450 scale) in RV analyses at one point in time. The ability of the calibrated scale scores to discriminate between groups was at least as good as the ability of the observed sum scales and often remarkably better.
Conclusion: The theoretical advantage of IRT models in scale calibration is supported by our results. This approach to achieving comparability of new and widely-used scales and accelerating the accumulation of interpretation guidelines based on previous work warrant testing for measures of other generic and disease-specific concepts.