Procedure to work on dataset

Hi @neheller, I recently decided to submit my work on the KiTS 21 challenge. I downloaded the dataset (300 cases) and have the following queries to start my work:

  1. In the case folder (for ex. case_00000), there are 2 directories (raw, segmentations) and 4 other files (3 aggregate segmented results with AND, OR, and MAJ, 1 is image file). Question: which segmented result, I need to consider out of these (specific any one or all) for the experiment (training)?

  2. Should I need to use segmented files from the segmentations folder? Why there are three annotations for each instance? Here I know, instances for two kidneys and three annotations for three different persons. But, my confusion is: How can we combine the segmented labels of different persons (different annotations)?

  3. For the experiment, should I need to do anything with the files under the raw folder?

  4. As per the requirements of this challenge, we need to segment kidney, tumor, and Cyst. However, in the segmentations directory information about kidney and tumor is given (like kidney_instance-1_annotation-1.nii.gz, tumor_instance-1_annotation-3.nii.gz). What about Cyst?

I am not familiar with this dataset, so my question may be basic. However, I expect a response soon. Thanks in advance.

Hi @sandeep,

  1. There are multiple approaches that you could take, but just using the aggregated_MAJ_seg.nii.gz is a simple approach that will have you training on data that is representative of the segmentations that you will be evaluated on. Another reasonable approach would be to use what I describe in 2)

  2. kits21/annotation/ can be used ​to generate a collection of aggregated segmentations for each case that you could randomly sample during training.

  3. The raw files can be safely ignored. These are an intermediate result in the pipeline from human annotation → segmentation. They’re public just for the sake of transparency.

  4. Cyst data is in the segmentations folder too, but only for cases that have at least one instance of a cyst (e.g., case_00021). Many of the cases do not have cysts, so they won’t have any segmentation files for cyst instances.

Please let me know if any of this is unclear.

1 Like

Thank you very much, @neheller for your quick response. Most of the points are clear now. But still confusion:

  1. I am going to start with the aggregated_MAJ_seg.nii.gz label file for training. is it fine?
    Here all cases have different folders. Is there any code available to copy all imaging.nii.gz files (from different folders) in a single folder (same for label files)? In point 2, you explained that?

  2. If using, aggregated_MAJ_seg.nii.gz as a label file then no need to worry about kidney, tumor, and Cyst. This file (aggregated_MAJ_seg.nii.gz) is already a collection of all. Right?

Hope, my questions are clear.

Hi @sandeep,

  1. Yes, I think that should be just fine. We haven’t released any code to copy everything to a single folder but it should be straightforward. Do you want me to send you a snip that would accomplish this?

  2. You are correct. Every label should be represented in the aggregated files.

1 Like

Thanks again. If possible, please share a snip (it will save my time). :blush:

Sure! Here’s a python3 approach:

from pathlib import Path
from shutil import copy
from tqdm import tqdm

# TODO replace these with your paths
SRC = Path("/path/to/kits21").resolve(strict=True)
DST = Path("/path/to/destination")

def main():
  dst_d = DST.parent.resolve(strict=True) /

  if not dst_d.exists():

  for i in tqdm(range(300)):
    cid = "case_{:05d}".format(i)
    src_img = SRC / "kits21" / "data" / cid / "imaging.nii.gz"
    src_seg = SRC / "kits21" / "data" / cid / "aggregated_MAJ_seg.nii.gz"

    assert src_img.exists() and src_seg.exists()

    copy(src_img, str(DST / "img__{}.nii.gz".format(cid)))
    copy(src_seg, str(DST / "seg__{}.nii.gz".format(cid)))

if __name__ == "__main__":

Just be sure to change SRC and DST to the appropriate paths on your filesystem.

1 Like

Hi @neheller, May you please explain the following line in How to Win section:

“These will be computed for every HEC of every case of the test set on a random sample of aggregated segmentations created using the logic implemented in [/kits21/annotation/].”

As discussed earlier, I am using the aggregated_MAJ_seg.nii.gz label file for training. However, as per the above line, it is telling to use the logic of for every case of the test set on a random sample of aggregated segmentations. My confusion is: is the aggregated_MAJ_seg.nii.gz doing the same tasks as described above or do I need to follow a different approach for the test dataset later like [/kits21/annotation/]? of course later, I will apply aggregated_MAJ_seg.nii.gz for all the test cases.

Yes, they have the same approach to producing a composite segmentation. The only difference is that the “MAJ” aggregation is using multiple annotations per instance (aggregated with majority voting) whereas the sampling approach will be using just one annotation per instance at a time.

Does that make sense?