We collected HNSCC patients (n=301) comprising Larynx, Pharynx, Oral, Sinonasal and Salivary gland carcinomas. Furthermore, treatment planning CT, PET, and MRI (T1w mDixon and T2w) images, as well as clinical delineations of the primary tumor (GTV-T) and nodal metastases (GTV-N) were also included. MRIs were deformable registered to PET/CT. The union of GTV-T and GTV-N were treated as ground truth (GTV-Clinic) for the DL prediction (GTV-DL).
We trained a 3D UNet for 1000 epochs in a five-fold cross-validation fashion. At test time, for each patient, 50 stochastic samples were drawn from the UNet with Monte Carlo dropouts(p=0.1) from snapshot-saved models. The mean of all output softmax probability maps was used to aggregate GTV-DL and uncertainty map.
The uncertainty map is a heatmap representing prediction uncertainties. We correlated the geometric location of the thresholded uncertainty map, the uncertainty regions (UR), with false predictions of the GTV-DL to locate potential predicted error regions (ER). We used the Dice similarity coefficient (Dice) to quantify the degree of overlap between UR and ER.
In order to detect patient-level segmentation failure, we employed overlap metrics, False Omission Rate, False Negative Rate, and Surface Dice between the UR and GTV-DL to estimate GTV-DL performance in Dice. A Gradient Boosting Regressor was applied for the Dice estimation. We evaluated the regression result using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (R2).