Copenhagen, Denmark
Onsite/Online

ESTRO 2022

Session Item

Saturday
May 07
09:00 - 10:00
Poster Station 1
01: Image processing & analysis
René Winter, Norway
1180
Poster Discussion
Physics
Multicenter comparison of measures for quantitative evaluation of automatic contouring
Ellen Brunenberg, The Netherlands
PD-0064

Abstract

Multicenter comparison of measures for quantitative evaluation of automatic contouring
Authors:

Ellen Brunenberg1, Jan Derks van de Ven1, Mark J Gooding2, Djamal Boukerroui2, Yong Gan3, Edward Henderson4, Gregory C Sharp5, Femke Vaassen6, Eliana Vasquez Osorio4, Jinzhong Yang7, René Monshouwer1

1Radboudumc, Radiation Oncology, Nijmegen, The Netherlands; 2Mirada Medical Ltd, Science, Oxford, United Kingdom; 3University of Groningen, University Medical Center Groningen, Radiation Oncology, Groningen, The Netherlands; 4University of Manchester, Division of Cancer Studies, School of Medical Sciences, Faculty of Biology, Medicine and Health, Manchester, United Kingdom; 5Massachusetts General Hospital, Harvard Medical School, Radiation Oncology, Boston, USA; 6Maastricht University Medical Centre, Department of Radiation Oncology (MAASTRO), GROW - School for Oncology and Developmental Biology, Maastricht, The Netherlands; 7The University of Texas MD Anderson Cancer Center, Radiation Physics, Houston, USA

Show Affiliations
Purpose or Objective

Automatic contouring performance can be evaluated quantitatively using geometric measures. Overlap measures such as Dice Similarity Coefficient (DSC) are computationally straightforward and provide coherent test results. However, definition and implementation of distance measures like the Hausdorff distance (HD) greatly influence results and hinder comparison between multiple centers [1]. To assess this, we have performed a multicenter benchmark study using both synthetic and real data.

Material and Methods

In our survey, contributors first had to list which measures they use in their contour evaluation pipeline, including definitions, implementation methods and (if applicable) source. In addition, they were asked to process two datasets. The first set contained synthetic shapes (squares, spheres, octahedrons), with different size, position and control point spacing between reference and test contours. The second set consisted of publicly available clinical CT data with contouring ground truth and test contours [2]. The resolution of both datasets was 0.977 mm in-plane, with 2 mm slices.

Results

The survey was filled out by 7 institutes, using 8 different implementations for DSC, 10 for maximum Hausdorff distance (HD100), 9 for 95th percentile Hausdorff distance (HD95), 12 for AD, 4 for surface DSC, and 3 for added path length (APL). Figure 1 shows the variation of contributions with respect to implementation choices for dimensionality and model.


Because most DSC results corresponded well, and for AD, surface DSC and APL, the definitions already varied widely, we focused on HD results. As can be seen in Figure 2, variation in results is large. Most of the outliers of HD100 (Figure 2.i and 2.iii) resulted from a mesh-based implementation which used a normal vector to measure the distance. For synthetic shapes B and C, the control point spacing was different between reference and test contours. The deviating measurements around 40-50 mm (Figure 2.i and 2.ii) were due to implementations without interpolation between test contour points.

Conclusion

While for HD, some differences in definition and implementation between institutes might be expected, this study highlighted the magnitude of the variation. Future work should focus on accuracy in order to develop a public benchmarking dataset, which can be used to optimize agreement on definition and implementation of contouring evaluation measures. Because HD100 is more sensitive to outliers than HD95, differences in implementation will be amplified in HD100 results. It is therefore advisable to (also) use HD95. When implementing an evaluation pipeline, the definition and implementation of the used measures should be considered carefully, and the pipeline should be validated with synthetic data.

 

[1] Yang, Sharp & Gooding. Auto-Segmentation for Radiation Oncology. 2021. https://doi.org/10.1201/9780429323782


[2] TCIA Lung CT Segmentation Challenge 2017. https://wiki.cancerimagingarchive.net/display/Public/Lung+CT+Segmentation+Challenge+2017