Rationale and objectives: To examine the effects of the number of categories in the rating scale used in an observer experiment on the results of ROC analysis by a simulation study.
Materials and methods: We have previously evaluated the effects of computer-aided diagnosis on radiologists' characterization of malignant and benign breast masses in serial mammograms. The evaluation of the likelihood of malignancy was performed on a quasi-continuous (0-100 points) confidence rating scale. In this study, we simulated the use of discrete confidence rating scales with fewer number of categories and analyzed the results with receiver operating characteristic (ROC) methodology. The observers' estimates of the likelihood of malignancy were also mapped to BI-RADS assessments with five and seven categories and ROC analysis was performed. The area under the ROC curve and the partial area index obtained from ROC analysis of the different confidence rating scales were compared.
Results: The fitted ROC curves and the performance indices do not change significantly when the confidence rating scales were varied from 6 to 101 points if the estimated operating points obtained directly from the data are distributed relatively evenly over the entire range of true-positive fraction (TPF) and false-positive fraction (FPF). The mapping of the likelihood of malignancy observer data to the seven-category BI-RADS assessment scale allowed reliable ROC analysis, whereas mapping to the five-category BI-RADS scale could cause erratic ROC curve fitting because of the lack of operating points in the mid-range or failure in ROC curve fitting because of data degeneration for some observers.
Conclusion: ROC analysis of discrete confidence rating scales with few but relatively evenly distributed data points over the entire FPF and TPF range is comparable to that of a quasi-continuous rating scale. However, ROC analysis of discrete confidence rating scales with few and unevenly distributed data points may cause unreliable estimations.