About tumor evaluation

koncle · July 25, 2019, 7:14am

In the evaluatin.py, the evaluation of tumor is calculated as follows. But if there is no tumor and the prediction is also none, this code will return 0 instead of 1. Is the final evaluation code the same as this code? Or it assumes that tumor exists in every case?

 try:
        # Compute tumor Dice
        tu_pd = np.greater(predictions, 1)
        tu_gt = np.greater(gt, 1)
        tu_dice = 2*np.logical_and(tu_pd, tu_gt).sum()/(
            tu_pd.sum() + tu_gt.sum()
        )
    except ZeroDivisionError:
        return tk_dice, 0.0

Jianhong_Cheng · July 25, 2019, 7:41am

Yes, I also think so.

FabianIsensee · July 25, 2019, 7:43am

Good find. What Dice to award if the ground truth is empty is an interesting problem. Two cases occur:

prediction is not empty (so there are false positives). This is definitely a Dice score of 0 (as also follows from the definition of dice)
both gt and prediction are empty. This case is undefined. But ideally there should be a reward for algorithms that correctly predict an empty segmentation mask.

In BraTS, 2) is awarded with a Dice score of 1. I am not sure if this is the correct way to do this. But in my opinion this is better than what is happening here because here we don’t distinguish between 1) and 2) (both are dice 0) and thus we don’t distinguish between a bad (1) and a good (2) segmentation.

In the end, it all comes down to whether there is tumor in every case in the test set or not.

Jianhong_Cheng · July 25, 2019, 7:57am

Hi, Isensee. Do you know how long it will take to get an official reply if I want to view the approximate score of my submission.

SuiAn · July 25, 2019, 8:39am

So the problem is whether there exist a tumor in each case ? Any measures to deal with this?

FabianIsensee · July 25, 2019, 8:47am

@Jianhong_Cheng neheller typically responds very quickly.

So the problem is whether there exist a tumor in each case ? Any measures to deal with this?

I don’t think this is relevant. There may be a tumor in all test cases but there may also be cases without tumor. Our algorithms should be able to deal with both. The question is about how the Dice is computed if both prediction and ground truth don’t have tumor. If you set the Dice to 1 then you are dealing with this case effectively. Dice is the only metric that is used for evaluation

Another idea that just came up is the following:
(using the 2 cases from above)

for this case leave dice at 0, as before
for this case set dice to np.nan. If you then aggregate the dice scores, use np.nanmean and this case will be excluded from mean aggregation. I think this is the best way to deal with this issue. Setting dice scores to 1 like I mentioned above is kind of arbitrary and encourages bad behavior (exactly what we did in our BraTS 2080 submission - it works for the challenge but is not what you want in real life).

Best,
Fabian

neheller · July 25, 2019, 12:14pm

@Jianhong_Cheng I typically process submissions three times per day – morning, late afternoon, and late evening in central daylight time. If I processed submissions as they came in at night I would never get any sleep

@koncle this is a great point, and I agree with @FabianIsensee that assigning a Dice of 0 in these cases is not desired behavior. However, for this cohort, having a renal tumor is part of the inclusion criteria (as stated in our manuscript) so the situation will never arise.

We debated about including some healthy controls in both the training and test sets, but we figured that since most patients have a healthy kidney contralaterally, the models would have to be able to recognize those situations regardless. Perhaps we’ll revisit this for next year.

This is really a question of study precision and recall vs world PPV and NPV, or at least some analog of those for segmentation. Since tumor prevalence is so much higher in the study population than the real world, you would expect a world PPV << study precision. If you were ever trying to apply such a model in the clinic, you would need to account for this either by making your clinical population more similar to your study population (e.g. only run the model for patients suspected of having kidney a tumor) and/or by incentivizing high-precision models by optimizing for something like a Tversky index.