Inter-observer variability of manual contour delineation of structures in CT

Eur Radiol. 2019 Mar;29(3):1391-1399. doi: 10.1007/s00330-018-5695-5. Epub 2018 Sep 7.

Abstract

Purpose: To quantify the inter-observer variability of manual delineation of lesions and organ contours in CT to establish a reference standard for volumetric measurements for clinical decision making and for the evaluation of automatic segmentation algorithms.

Materials and methods: Eleven radiologists manually delineated 3193 contours of liver tumours (896), lung tumours (1085), kidney contours (434) and brain hematomas (497) on 490 slices of clinical CT scans. A comparative analysis of the delineations was then performed to quantify the inter-observer delineation variability with standard volume metrics and with new group-wise metrics for delineations produced by groups of observers.

Results: The mean volume overlap variability values and ranges (in %) between the delineations of two observers were: liver tumours 17.8 [-5.8,+7.2]%, lung tumours 20.8 [-8.8,+10.2]%, kidney contours 8.8 [-0.8,+1.2]% and brain hematomas 18 [-6.0,+6.0] %. For any two randomly selected observers, the mean delineation volume overlap variability was 5-57%. The mean variability captured by groups of two, three and five observers was 37%, 53% and 72%; eight observers accounted for 75-94% of the total variability. For all cases, 38.5% of the delineation non-agreement was due to parts of the delineation of a single observer disagreeing with the others. No statistical difference was found for the delineation variability between the observers based on their expertise.

Conclusion: The variability in manual delineations for different structures and observers is large and spans a wide range across a variety of structures and pathologies. Two and even three observers may not be sufficient to establish the full range of inter-observer variability.

Key points: • This study quantifies the inter-observer variability of manual delineation of lesions and organ contours in CT. • The variability of manual delineations between two observers can be significant. Two and even three observers capture only a fraction of the full range of inter-observer variability observed in common practice. • Inter-observer manual delineation variability is necessary to establish a reference standard for radiologist training and evaluation and for the evaluation of automatic segmentation algorithms.

Keywords: Humans; Observer variation; Reproducibility of results.

MeSH terms

  • Algorithms*
  • Brain Neoplasms / diagnosis*
  • Humans
  • Kidney Neoplasms / diagnosis*
  • Liver / diagnostic imaging*
  • Liver Neoplasms / diagnosis*
  • Lung Neoplasms / diagnosis*
  • Multidetector Computed Tomography / methods*
  • ROC Curve