Reproducibility Issue
See original GitHub issueI have ran your codes 5 times in the below environment.
Two V100 GPUs
Python 3.6.7
PyTorch 1.0.0
Cuda 9.0
The command I used is this :
python train.py \
--net_type pyramidnet \
--dataset cifar100 \
--depth 200 \
--alpha 240 \
--batch_size 64 \
--lr 0.25 \
--expname PyraNet200 \
--epochs 300 \
--beta 1.0 \
--cutmix_prob 0.5 \
--no-verbose
For the baseline, I set cutmix_prob=0.0 not to use cutmix.
| Model & Augmentations | try1 | try2 | try3 | try4 | try5 | Average – | – | – | – | – | – | – | – cutmix p=0.0 | Pyramid200(Converged) | 17.14 | 16.32 | 16.15 | 16.29 | 16.61 | 16.502 | Pyramid200(Best) | 17.01 | 16.02 | 16.01 | 16.17 | 16.35 | 16.312 cutmix p=0.5 | CutMix(Converged) | 16.27 | 15.55 | 16.18 | 16.19 | 15.38 | 15.914 | CutMix(Best) | 15.29 | 14.66 | 15.28 | 15.04 | 14.52 | 14.958
The baseline has a similar top-1 accuracy as your paper said (16.45), but with cutmix(p=0.5), the result is somewhat poor compared to the reported value(14.23).
Also, I conducted an experiments with shakedrop (after codes for shakedrop regularization has been brought from ‘https://github.com/owruby/shake-drop_pytorch’).
| | try1 | try2 | try3 | try4 | try5 | Average – | – | – | – | – | – | – | – cutmix p=0.5 | ShakeDrop+CutMix(Converged) | 14.06 | 14 | 14.16 | 13.86 | 14 | 14.016 | ShakeDrop+CutMix(Best) | 13.67 | 13.81 | 13.8 | 13.69 | 13.62 | 13.718
Here you can see, top-1 accuracy you claimed on the paper can be achieved only by using ‘maximum top-1 validation accuracy’ during training, not by using ‘converged top-1 validation accuracy’ after training.
So, here is my questions.
-
How can I reproduce your result? Especially with your provided codes and sample commands, I should reproduce 14.23% of Top1 Accuracy with PyramidNet+Cutmix. It will be great if you can provide the specific environment and command to reproduce the result or this helps you to find some problems on this repo.
-
Did you use ‘last validation accuracy’ after training or ‘best validation accuracy(peak accuracy)’ while training? I saw some codes tracking the best validation accuracy while training and print out the value before terminating, so I assume that you used ‘best(peak) validation accuracy’.
Thanks. I look forward to hearing from you.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9
@ildoonet Thank you for your reply. I do understand your concerns but I don’t agree that mentioning the best performance is cheating. As I said, the best model surely can be treated to represent the performance of the method. The difference between the best and the last model is coming from the step decaying learning rate. In our case of using cosine learning rate on CIFAR100, the best and last model is almost the same (within ± 0.1% acc). All the experiments we re-implemented are conducted in the same experiment setting, the best model is selected for every other method, so there is no cheating and fair-comparison issues. Our best model’s performance is not instantly peaked high value because we conducted several times and report the mean of the best performances.
‘Cheating’ is a rather harsh word. However, comparing the peak value indeed benefits oscillating and risky methods.