Statistical discrepancies in GTV delineation for H&N cancer across expert centers
MO-0476
Abstract
Statistical discrepancies in GTV delineation for H&N cancer across expert centers
Authors: Amaury Leroy1,5, Nikos Paragios1, Eric Deutsch2, Vincent Grégoire3, Diana Mitrea4, Adeline Pêtre3, Roger Sun5, Yun Gan Tao4
1Therapanacea, Artificial Intelligence, Paris, France; 2Gustave Roussy, Paris-Saclay University, Inserm 1030, Molecular Radiotherapy and Therapeutic Innovation, Villejuif, France; 3Centre Léon Bérard, Radiation Oncology, Lyon, France; 4Gustave Roussy, Radiation Oncology, Villejuif, France; 5Gustave Roussy, Paris-Saclay University, Inserm 1030, Molecular Radiotherapy and Therapeutic Innovation, Villejuif, France
Show Affiliations
Hide Affiliations
Purpose or Objective
Accurate delineation of the primary tumor GTV is a decisive
early step for radiotherapy since it impacts dose prescription, overall
treatment toxicity, patient outcome and lifelong sequels. The aim of our work
is to assess variability in GTV definition for H&N cancer through a
statistical study involving two independent centers with observers of different
experiences each. We also focus on the benefit of a consensus in the clinical
routine and the need to incorporate multimodal imaging to add biological and
functional insight in target volume delineation.
Material and Methods
We have settled a retrospective cohort made of 45 patients,
for which was provided a contrast enhanced CT acquisition and the report from
endoscopy with photographic images and clinical data. For each center, junior
and senior radiotherapists independently delineated the GTV with standardized
rules. Initial statistical comparisons were conducted, such as volume, Dice
score and Hausdorff distance, to assess inter-observer variability both in
terms of center and experience. Next, we asked the senior practitioners to
review each patient towards possible consensus. Based on their discussion, we
updated the statistics as they were able either to find a common target volume
or to stick to their original assessment, thus confirming disagreement.
Results
Table 1 reports an initial Dice score of 0.68 and Hausdorff
distance of 12.1mm between senior observers. This strong disagreement warns us
about the lack of standardization in treatment. Within the same center, lower
variability between junior and senior (Dice of 0.71 for A and 0.73 for B)
highlights bias in routine practice characteristic to each institution. The
main difference between juniors and seniors lays in the tumor volume, bigger
for juniors (≈31cm³ against ≈24cm³ for seniors), who usually
prefer to avoid false-negative signals. During consensus, discussions lead to
three main remarks: for 33% of patients, one observer aligned with his
colleague’s decision. 44% of cases were still in disagreement, the main
explanation being that one center often excluded peritumoral edema from GTV.
Finally, 23% of patients had similar delineations, becoming equal when
extending to CTV. We computed statistics on updated volumes, with a new Dice
score of 0.78 and Hausdorff distance of 7.4mm. Figure 1 shows a typical example
of disagreement.
Conclusion
A significant deleterious inter-observer variability appears
for GTV delineations, which can be explained by differences in interpretation
of the endoscopy, level of experience, or working practice proper to each
institution. An improved agreement was found after consensus as discussions
acted as a sanity check and showed benefit for clinical routine. This study
reinforces the need for multimodality when dealing with target volume
definition, like multiparametric functional imaging or biopsies. Moreover, the
development of artificial intelligence solutions for standardization and
treatment automation could also be of great help.