Reproducing the paper results
See original GitHub issueDear authors,
Thank you for your exciting work and very clean code. I am having trouble reproducing the results mentioned in the paper and would appreciate it if you could help me.
1. Reproducing the UNO results from table 4. I was trying to get the scores on the samples of novel classes from the test split (Table 4 in the paper).
I have executed the commands for CIFAR10, CIFAR80-20, CIFAR50-50, and used Wandb for logging. However, the results on all datasets did not match the ones that I see in the paper. I took the results from incremental/unlabel/test/acc
.
. | Paper (avg/best) | Reproduced (avg/best) |
---|---|---|
CIFAR10 | 93.3 / 93.3 | 90.8 / 90.8 |
CIFAR80-20 | 72.7 / 73.1 | 65.3 / 65.3 |
CIFAR50-50 | 50.6 / 50.7 | 44.9 / 45.7 |
Potential issues:
- I am not using the exact versions of the packages mentioned in your ReadMe, and for that reason, I have run the CIFAR80-20 experiment twice, manually setting the seed (as in RankStats repo), however, I obtained very similar results. I also would not suspect a ~7% difference on CIFAR80-20 just to to the package version.
- I may be using the wrong metric from wandb (I have used
incremental/unlabel/test/acc
). However, if you check my screenshot, for CIFAR80-20 all the other metrics are significantly different anyway (the value close to 72.7/73.1 does not appear anywhere).
2. How exactly the RankStats algorithm was evaluated on CIFAR50-50.
Could you please share if you performed any hyperparameter tuning for the CIFAR50-50 dataset when running the RankStats algorithm on it? I made multiple experiments and my training was very unstable, the algorithm always ends up scoring ~20/17 on known/novel classes.
Thanks a lot for your time.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Ok, now I understand. You are right, this is a potential problem. However, the assignments are quite stable (they are computed on the whole validation set) and, as you said, the potential issue never happens in practice. I remember I tried once to remove the unwanted assignments (the ones that contradicted the labeled head), but the results were exactly the same, while the code was more complicated, so I just removed it. Also, if I remember correctly, in Ranking Statistics they use the same evaluation procedure, so I just sticked to that.
Happy to help! I added a note in the README that warns about package versions.
Regarding the evaluation, I think the procedure I am following is correct because I am first concatenating the logits (
preds_inc
) and then I am taking the max of those concatenated logits. By doing this I lose the information about the task. Then in thecompute()
method of the metric class, I compute the best mapping on all classes (and not separately).