We used bootstrap sampling to estimate the statistical significance of each pair of rankings. The blue/gray bars on the right side extend from each submission to the lowest submission that was not statistically significantly different (p < 0.05). When you hover over a particular submission, all submissions that were not statistically significantly different will stay sharp where the rest will have reduced opacity. My apologies if the UI is clunky, I built it in a hurry. Please let me know if you find bugs.
Is the problem solved?
Theoretically these scores are bounded by the label noise of the test set. While you all were doing inference, we went back and re-annotated 30 cases that we randomly selected from the training set (that way we could release them, we’ll do that soon), and our inter-annotator agreement was as follows:
Kidney Dice: 0.983
Tumor Dice: 0.923
Composite Dice: 0.953
So the winning teams (composite dice 0.90-0.91) did get pretty close! But it looks like there’s still a small ways to go, especially on the tumor. I think next year we’re going to have to come up with a more difficult segmentation problem to keep you all from getting bored…