Heleen Bollen1, Sandra Nuyts1, Siri Willems2, Frederik Maes2
1KU Leuven, Laboratory of Experimental Radiotherapy, Leuven, Belgium; 2KU Leuven, Processing Speech and Images (PSI), Leuven, Belgium
Accurate radiotherapy treatment (RT) of head and neck cancer (HNC) requires precise delineation of target volumes (TV). Delineation is performed manually using several image modalities, e.g. CT and PET. Since delineation is highly experience and perception dependent, there’s growing interest in automation of the delineation process. Literature for automation in head and neck cancer is limited to the performance of unimodal networks. The goal of our research was to create a 3D convolutional neural network (CNN) that uses the information from multiple modalities to improve the overall performance of the segmentation compared to unimodal approaches.
The dataset consists of 70 patients with oropharyngeal cancer. For each patient, planning CT image (pCT), PET imaging and manual delineation of the primary (GTVp) and nodal gross tumor volume (GTVn), acquired by two radiation oncologists, were available. PET was rigidly registered to the planning CT image using Eclipse (Varian medical systems, Palo Alto, CA). A 3D CNN was developed with two separate input pathways, one for each modality, such that each pathway may focus on learning patterns for that specific modality. At certain points in the model, a connecting layer is implemented to transfer information between both pathways. At the end of the model, the pathways are concatenated, and a final classifier layer uses the received info to predict the final segmentation label. The performance of this approach was compared to unimodal approaches (pCT model and a PET model) using the Dice similarity coefficient (DSC), the mean surface distance (MSD) and the 95% Hausdorff distance (HD95).
The multimodal approach performs best for all metrics for both the GTVp and GTVn, as shown in Table 1. The DSC improves from 48.0% (for pCT model) and 48.9% (for PET model) to 59.1% (for pCT+PET model) for the GTVp while the GTVn reaches an average DSC of 62.8%. Adding PET information reduced the small false positive spots in the delineation result compared to pCT model and PET model. A reduction of the absolute volume difference was seen for both GTVp and GTVn, as shown in Figure 1.Table 1 5-fold cross validation results for the pCT model, PET model and the multimodal approachFigure 1 absolute volume differences in ml between manual and automatic delineation for the pCT model, PET model and the multimodal approach, with GTVp in purple and GTVn in green.
Adding functional PET information improves the overall segmentation result, compared to a unimodal network based on pCT input only. Automation of segmentation in HNC offers the possibility of implementing more advanced RT techniques, e.g. adaptive RT and proton therapy. However, performance of existing unimodal networks has been insufficient for clinical implementation. The introduction of multimodality networks could identify a solution for automated delineation of TVs in HNC. We foresee the addition of MRI imaging to the multimodal CNN by the start of ESTRO conference.