Unable to reproduce classification accuracy using the reference scripts
See original GitHub issue🐛 Bug
I have been trying to reproduce the reported 79.312% accuracy on ImageNet of resnext101_32x8d
using the reference scripts, but I could obtain only 75%-76%. I try two different trainings on 64 GPUs:
- 16 nodes of 4 V100 GPUs
- 8 nodes of 8 V100 GPUs
but obtained similar results.
To Reproduce
Clone the master branch of torchvision
, then cd vision/references/classification
and submit a training to 64 GPUs with arguments --model resnext101_32x8d --epochs 100
.
The training logs (including std logs) are attached for your information: log.txt and resnext101_32x8d_reproduced.log
Expected behavior
Final top-1 accuracy should be around 79%.
Environment
- PyTorch / torchvision Version (e.g., 1.0 / 0.4.0): 1.8.1
- OS (e.g., Linux): Linux
- How you installed PyTorch / torchvision (
conda
,pip
, source):pip
- Python version: 3.8
- CUDA/cuDNN version: 10.2
- GPU models and configuration: V100
cc @vfdev-5
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (16 by maintainers)
Top Results From Across the Web
Failure of Classification Accuracy for Imbalanced Class ...
Classification accuracy involves first using a classification model to make a prediction for each example in a test dataset. The predictions are ...
Read more >My dogs vs cats models always have 0.5 accuracy
Confirming issue is occurring: Method 1: accuracy for model stays around 0.5 while training (or 1/n where n is number of classes). Method...
Read more >Troubleshoot designer component errors - Azure Machine ...
Select the failed component, go to the Outputs+logs tab, ... trying to compare the accuracy of a linear regressor with a binary classifier....
Read more >Choosing the "Correct" Seed for Reproducible Research/Results
The whole point of the seeds is that you've got a script that someone can use to completely reproduce the exact results you...
Read more >Understanding Confusion Matrix | by Sarang Narkhede
Well, it is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I was able to reproduce the results:
Here is my output log: resnext101_32x8d_logs.txt
Comparing the logs shared above with mine, I see workers and world size being different.
workers=16, world_size=8
vsworkers=10, world_size=64
What did you pass for
--gpus-per-node
? At the top of the log file, it says4 GPUs per node
. I guess the reported results are withgpus-per-node=8
.This is the command I ran:
@datumbox Thank you so much for the detailed response and for your transparency! In the issue that you mentioned, there appear to have enough information to reproduce the results (may except one detail, let me post a question there).