This work is focused on the generation and utilization of a reliable ground truth (GT) segmentation for a large medical repository of digital cervicographic images (cervigrams) collected by the National Cancer Institute (NCI). NCI invited twenty experts to manually segment a set of 939 cervigrams into regions of medical and anatomical interest. Based on this unique data, the objectives of the current work are to: (1) Automatically generate a multi-expert GT segmentation map; (2) Use the GT map to automatically assess the complexity of a given segmentation task; (3) Use the GT map to evaluate the performance of an automated segmentation algorithm. The multi-expert GT map is generated via the STAPLE (Simultaneous Truth and Performance Level Estimation) algorithm, which is a well-known method to generate a GT segmentation from multiple observations. A new measure of segmentation complexity, which relies on the inter-observer variability within the GT map, is defined. This measure is used to identify images that were found difficult to segment by the experts and to compare the complexity of different segmentation tasks. An accuracy measure, which evaluates the performance of automated segmentation algorithms is presented. Two algorithms for cervix boundary detection are compared using the proposed accuracy measure. The measure is shown to reflect the actual segmentation quality achieved by the algorithms. The methods and conclusions presented in this work are general and can be applied to different images and segmentation tasks. Here they are applied to the cervigram database including a thorough analysis of the available data.