question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to reproduce classification accuracy using the reference scripts

See original GitHub issue

🐛 Bug

I have been trying to reproduce the reported 79.312% accuracy on ImageNet of resnext101_32x8d using the reference scripts, but I could obtain only 75%-76%. I try two different trainings on 64 GPUs:

  • 16 nodes of 4 V100 GPUs
  • 8 nodes of 8 V100 GPUs

but obtained similar results.

To Reproduce

Clone the master branch of torchvision, then cd vision/references/classification and submit a training to 64 GPUs with arguments --model resnext101_32x8d --epochs 100.

The training logs (including std logs) are attached for your information: log.txt and resnext101_32x8d_reproduced.log

Expected behavior

Final top-1 accuracy should be around 79%.

Environment

  • PyTorch / torchvision Version (e.g., 1.0 / 0.4.0): 1.8.1
  • OS (e.g., Linux): Linux
  • How you installed PyTorch / torchvision (conda, pip, source): pip
  • Python version: 3.8
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: V100

cc @vfdev-5

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:16 (16 by maintainers)

github_iconTop GitHub Comments

2reactions
prabhat00155commented, Sep 10, 2021

I was able to reproduce the results:

Acc@1 79.314 Acc@5 94.566 

Here is my output log: resnext101_32x8d_logs.txt

Comparing the logs shared above with mine, I see workers and world size being different. workers=16, world_size=8 vs workers=10, world_size=64

What did you pass for --gpus-per-node? At the top of the log file, it says 4 GPUs per node. I guess the reported results are with gpus-per-node=8.

This is the command I ran:

srun -p train --cpus-per-task=16 -t 110:00:00 --gpus-per-node=8 python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model resnext101_32x8d --epochs 100 --output-dir logs/run2 > logs/run2/resnext101_32x8d_logs.txt 2>&1
1reaction
netw0rkf10wcommented, Dec 21, 2021

@datumbox Thank you so much for the detailed response and for your transparency! In the issue that you mentioned, there appear to have enough information to reproduce the results (may except one detail, let me post a question there).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Failure of Classification Accuracy for Imbalanced Class ...
Classification accuracy involves first using a classification model to make a prediction for each example in a test dataset. The predictions are ...
Read more >
My dogs vs cats models always have 0.5 accuracy
Confirming issue is occurring: Method 1: accuracy for model stays around 0.5 while training (or 1/n where n is number of classes). Method...
Read more >
Troubleshoot designer component errors - Azure Machine ...
Select the failed component, go to the Outputs+logs tab, ... trying to compare the accuracy of a linear regressor with a binary classifier....
Read more >
Choosing the "Correct" Seed for Reproducible Research/Results
The whole point of the seeds is that you've got a script that someone can use to completely reproduce the exact results you...
Read more >
Understanding Confusion Matrix | by Sarang Narkhede
Well, it is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found