Results are out!


The Link

Statstical Significance

We used bootstrap sampling to estimate the statistical significance of each pair of rankings. The blue/gray bars on the right side extend from each submission to the lowest submission that was not statistically significantly different (p < 0.05). When you hover over a particular submission, all submissions that were not statistically significantly different will stay sharp where the rest will have reduced opacity. My apologies if the UI is clunky, I built it in a hurry. Please let me know if you find bugs.

Is the problem solved?

Theoretically these scores are bounded by the label noise of the test set. While you all were doing inference, we went back and re-annotated 30 cases that we randomly selected from the training set (that way we could release them, we’ll do that soon), and our inter-annotator agreement was as follows:

Kidney Dice: 0.983
Tumor Dice: 0.923
Composite Dice: 0.953

So the winning teams (composite dice 0.90-0.91) did get pretty close! But it looks like there’s still a small ways to go, especially on the tumor. I think next year we’re going to have to come up with a more difficult segmentation problem to keep you all from getting bored… :slight_smile:


hi, neheller. When will reopen the submission?


Thanks for your efforts. I really enjoy this challenge.
For post submission, I recommend that each submission should be accompanied by a publication on arXiv or any other official preprint server so as to avoid training on the test set, which makes the leaderboard less meaningful.


BTW, how long it will cost for radiologists annotating the test cases on average?


@JML, sorry for the delay. Submissions have been reopened for the open leaderboard.

@junma, I agree that that would be ideal, but unfortunately we don’t have the resources to enforce it moving forward. On average, annotation was taking us roughly an hour and a half per case.