Clinical generalisability of a custom auto-contouring model for Prostate radiotherapy
Marina Khan,
United Kingdom
PD-0330
Abstract
Clinical generalisability of a custom auto-contouring model for Prostate radiotherapy
Authors: Yasmin McQuinlan1, Teresa Guerrero Urbano2, David Eaton2, Michael Battye1, Mark Gooding1, Marina Khan2
1Mirada Medical, Science and Research, Oxford, United Kingdom; 2Guy's and St Thomas' NHS Foundation Trust, Radiotherapy, London, United Kingdom
Show Affiliations
Hide Affiliations
Purpose or Objective
The performance of Artificial Intelligence (AI) based contouring solutions depends on the quality of the data provided and assessment is often done using the development set. Within a public healthcare setting, this makes it difficult to understand generalisability beyond a given population. The purpose of the study was to evaluate the generalisability of a clinic specific AI autocontouring model on an independent test set.
Material and Methods
Computed Tomography (CT) scans from 200 Prostate patients were retrospectively collected from a National Health Service Trust (NHS). A single observer outlined Prostate, Seminal Vesicles, Rectum, Bladder, Penile Bulb and Femoral Heads according to consensus guidelines, on each CT. The contours were peer-reviewed by a Consultant Oncologist specializing in Prostate radiotherapy. The contours used in the training data were compliant to consensus guidelines. The Research Autosegmentation Model (RAM) was trained on 160 of those cases and evaluated on a test set of 20 cases. The outputs of the model were assessed quantitatively using Added Path Length (APL), 2D 95% Hausdorff Distance (HD2D95) and 3D Dice Similarity Coefficient (DSC). A commercial deep learning contouring model (DLC), trained on another population, was evaluated on the RAM test set. The DLC model was developed to comply with consensus guidelines. Both models were then assessed for performance on a third external dataset, sourced from a United Kingdom (UK) population. This external dataset had reference contours, outlined to consensus guidelines. A Wilcoxon Sign Rank Test was used to determine statistical significance. This statistical test was chosen to determine if the outputs of RAM and DLC, from a single group of shared patients, are significantly different from each other.
Results
As expected, each model performed more favourably on the dataset population from which the model was derived. On the independent UK external data set, performance was comparable. Observing DSC, most structures showing no statistically significant difference in performance, except for Prostate, p=0.05. For HD2D95, only Femoral Head Left and Right showed statistical significance, with p<0.01 and p<0.05, respectively. For APL, normalised to reference contour length, all structures showed statistically significant difference with p<0.05, except Seminal Vesicles and Penile Bulb.
Conclusion
As expected, both models perform favourably on data that is reflective of their training population. Each model performed comparably on the external UK dataset. The results suggest that clinical utility can be found in bespoke and externally developed models. However to better understand performance and generalisability, independent testing should be recommended for institutions or vendors developing autosegmentation models for radiotherapy. Model evaluation on the test set alone insufficient to assess performance and generalisability, particularly in a public health setting.