Head and Neck Tumor Segmentation for MR-Guided Applications Banner

The challenge is now over. Congratulations to all the winners and participants! Post-challenge proceedings have been published in Springer LNCS. A secondary analysis paper is planned for the near future.

Tasks and Evaluation¶

Tasks¶

There are two tasks in this data challenge. Participants are free to engage with one or both tasks as they see fit (you are not required to participate in both).

Task 1: Fully-automatic segmentation of tumor volumes (GTVp and GTVn) on pre-RT MRI.
Task 2: Fully-automatic segmentation of tumor volumes (GTVp and GTVn) on mid-RT MRI.

Task 1 (Pre-RT Segmentation) Specific Details¶

Test data for Task 1 will be composed of an unseen pre-RT scan and will not contain any annotations. The goal is to successfully predict the GTVp and GTVn tumor segmentations on new unseen pre-RT images. This is a task analogous to previous conventional tumor segmentation challenges, such as Task 1 of the 2022 HECKTOR Challenge and Task 2 of the 2023 SegRap Challenge.

When training their algorithms for Task 1, participants can choose to use only pre-RT data or add in mid-RT data as well. Initially, our plan was to limit participants to utilizing only pre-RT data for training their algorithms in Task 1. However, upon reflection, we recognized that in a practical setting, individuals aiming to develop auto-segmentation algorithms could theoretically train using any accessible data at their disposal. Based on current literature, we actually don't know what the best solution would be! Would the incorporation of mid-RT data for training a pre-RT segmentation model actually be helpful, or would it merely introduce harmful noise? The answer remains unclear. Therefore, we leave this choice to the participants. Remember, though, during testing, you will ONLY have the pre-RT image as an input to your model (naturally, since this is a pre-RT segmentation task and you won't know what mid-RT data for a patient will look like).

Note: For the challenge results, we have developed a baseline model for Task 1 using the 3dfullres nnUNet framework with its default settings. The model was trained using only pre-RT images as input, employing a 5-fold cross-validation ensemble and running for 1000 epochs.

Task 2 (Mid-RT Segmentation) Specific Details¶

Test data for Task 2 will be composed of an unseen mid-RT image, pre-RT image with segmentation, and registered pre-RT image with registered segmentation. In other words, you will only have annotations for the pre-RT image, which would mimic a real-world scenario for adaptive RT. The goal is to successfully predict the GTVp and GTVn segmentations on new unseen mid-RT images. This task is somewhat analogous to previous challenges that utilize multiple image inputs such as the 2023 SegRap Challenge (non-contrast CT + contrast CT) and the 2023 HaN-Seg Challenge (CT + MRI).

For training, in addition to the original images, we have also provided a registered pre-RT MRI volume (deformably registered where the mid-RT scan serves as the fixed image and pre-RT scan serves as the moving image) and the corresponding registered pre-RT segmentation for each patient. Details on how registrations were performed and an example are provided on our GitHub. We offer this data for participants who opt not to integrate any image registration techniques into their algorithms but still wish to use the two images as a joint input to their model. Moreover, in a real-world adaptive RT context, such registered scans are typically readily accessible. Naturally, participants are also free to incorporate their own image registration processes into their pipelines if they wish, as they will have access to the original images.

Participants will be free to use any combination of input images/masks to develop their mid-RT auto-segmentation algorithms. In practice, what this means is you could do any of the following:

Train using only pre-RT images as input (would be kind of odd, but more power to you if it works).
Train using only mid-RT images as input (the "conventional" approach).
Train using pre-RT images and mid-RT images as individual separate inputs.
Train using registered pre-RT and mid-RT images as joint input.
Train using registered pre-RT images with pre-RT segmentation and mid-RT images as joint input.
Or anything you can think of that leverages the data!

To reiterate, during testing for Task 2, you will have the original pre-RT image, original pre-RT segmentation, registered pre-RT image, registered pre-RT segmentation, and mid-RT image as possible inputs to your model (i.e., everything except the mid-RT mask). You can ignore any of these provided pieces of data (except the mid-RT image of course) if so desired.

Note: For the challenge results, we have developed a baseline model for Task 2 using the 3dfullres nnUNet framework with its default settings. The model was trained using only mid-RT images as input, employing a 5-fold cross-validation ensemble and running for 1000 epochs.

Evaluation Metric¶

Both tasks will be evaluated in the same general manner using the aggregated Dice Similarity Coefficient (DSCagg). DSCagg was employed by Andrearczyk et al. for the segmentation task of the 2022 edition of the HECKTOR Challenge (doi: 10.1007/978-3-031-27420-6_1).

Specifically, the DSCagg metric is defined as:

where Ai and Bi are the ground truth and predicted segmentation for image i, where i spans the entire test set. DSCagg was initially described in detail in this paper (doi: 10.1109/EMBC48229.2022.9871907).

Conceptually, the 2022 edition of the HECKTOR Challenge had similar segmentation outputs (GTVp and GTVn for head and neck cancer patients) as our proposed challenge, so we deem this an appropriate metric. Since the presence of GTVp and GTVn will not be consistent across all cases, the proposed DSCagg metric is well-suited for this task. Unlike the conventional volumetric DSC, which may be disproportionately affected by a single false negative/positive result (yielding a DSC of 0), this metric is designed to accommodate such occurrences more effectively. Notably, DSCagg was shown to be a stable metric with respect to final ranking from a secondary analysis of the HECKTOR 2021 results (doi: 10.1016/j.media.2023.102972).

The metric will be computed individually for GTVp (DSCagg-GTVp) and GTVn (DSCagg-GTVn), and the average of the two (DSCagg-mean) will be used for the final challenge ranking (similar to HECKTOR 2022). The metric will be calculated for Task 1 (pre-RT segmentations) and Task 2 (mid-RT segmentations) separately. We have provided an example of how the DSCagg will be calculated for this challenge on our GitHub.

Note: The predicted segmentation masks should be in the same size, spacing, origin, and direction as the corresponding input MRI (i.e., pre-RT image for Task 1, mid-RT image for Task 2). The evaluation script will throw an error if sitk.GetSpacing() is not equivalent for the prediction and ground truth. Predicted segmentation masks will be resampled to the physical domain of the ground truth mask in the evaluation script. The expected values are 1 for the predicted GTVp, 2 for GTVn, and 0 for the background.

Docker Submission¶

Test data will not be made public, and participants will be required to submit docker containers of their solutions. More details are provided on the Submission Instructions page.