ESTRO 2023

Session Item

Saturday

May 13

16:45 - 17:45

Business Suite 3-4

Automation and machine learning

Chair: Dietmar Georg, Austria

Overview: Poster Discussions are presented in one of the sessions scheduled at the two poster discussion theatres. Each author will present a digital poster orally for 2 minutes, followed by 2 minutes for discussion. Sessions will not be recorded.

Session Type: Poster Discussion

Track: Physics

Journey:

Clinical generalisability of a custom auto-contouring model for Prostate radiotherapy

Marina Khan, United Kingdom

Presentation Number: PD-0330

Abstract

Abstract Title:

Clinical generalisability of a custom auto-contouring model for Prostate radiotherapy

Authors:

Yasmin McQuinlan¹, Teresa Guerrero Urbano², David Eaton², Michael Battye¹, Mark Gooding¹, Marina Khan²

¹Mirada Medical, Science and Research, Oxford, United Kingdom; ²Guy's and St Thomas' NHS Foundation Trust, Radiotherapy, London, United Kingdom

Show Affiliations

Purpose or Objective

The performance of Artificial Intelligence (AI) based contouring solutions depends on the quality of the data provided and assessment is often done using the development set. Within a public healthcare setting, this makes it difficult to understand generalisability beyond a given population. The purpose of the study was to evaluate the generalisability of a clinic specific AI autocontouring model on an independent test set.

Material and Methods

Computed Tomography (CT) scans from 200 Prostate patients were retrospectively collected from a National Health Service Trust (NHS). A single observer outlined Prostate, Seminal Vesicles, Rectum, Bladder, Penile Bulb and Femoral Heads according to consensus guidelines, on each CT. The contours were peer-reviewed by a Consultant Oncologist specializing in Prostate radiotherapy. The contours used in the training data were compliant to consensus guidelines. The Research Autosegmentation Model (RAM) was trained on 160 of those cases and evaluated on a test set of 20 cases. The outputs of the model were assessed quantitatively using Added Path Length (APL), 2D 95% Hausdorff Distance (HD2D95) and 3D Dice Similarity Coefficient (DSC). A commercial deep learning contouring model (DLC), trained on another population, was evaluated on the RAM test set. The DLC model was developed to comply with consensus guidelines. Both models were then assessed for performance on a third external dataset, sourced from a United Kingdom (UK) population. This external dataset had reference contours, outlined to consensus guidelines. A Wilcoxon Sign Rank Test was used to determine statistical significance. This statistical test was chosen to determine if the outputs of RAM and DLC, from a single group of shared patients, are significantly different from each other.

Results

As expected, each model performed more favourably on the dataset population from which the model was derived. On the independent UK external data set, performance was comparable. Observing DSC, most structures showing no statistically significant difference in performance, except for Prostate, p=0.05. For HD2D95, only Femoral Head Left and Right showed statistical significance, with p<0.01 and p<0.05, respectively. For APL, normalised to reference contour length, all structures showed statistically significant difference with p<0.05, except Seminal Vesicles and Penile Bulb.

Conclusion

As expected, both models perform favourably on data that is reflective of their training population. Each model performed comparably on the external UK dataset. The results suggest that clinical utility can be found in bespoke and externally developed models. However to better understand performance and generalisability, independent testing should be recommended for institutions or vendors developing autosegmentation models for radiotherapy. Model evaluation on the test set alone insufficient to assess performance and generalisability, particularly in a public health setting.