Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to reproduce classification accuracy using the reference scripts

See original GitHub issue

🐛 Bug

I have been trying to reproduce the reported 79.312% accuracy on ImageNet of resnext101_32x8d using the reference scripts, but I could obtain only 75%-76%. I try two different trainings on 64 GPUs:

16 nodes of 4 V100 GPUs
8 nodes of 8 V100 GPUs

but obtained similar results.

To Reproduce

Clone the master branch of torchvision, then cd vision/references/classification and submit a training to 64 GPUs with arguments --model resnext101_32x8d --epochs 100.

The training logs (including std logs) are attached for your information: log.txt and resnext101_32x8d_reproduced.log

Expected behavior

Final top-1 accuracy should be around 79%.

Environment

PyTorch / torchvision Version (e.g., 1.0 / 0.4.0): 1.8.1
OS (e.g., Linux): Linux
How you installed PyTorch / torchvision (conda, pip, source): pip
Python version: 3.8
CUDA/cuDNN version: 10.2
GPU models and configuration: V100

cc @vfdev-5

Issue Analytics

State:
Created 2 years ago
Comments:16 (16 by maintainers)

Top GitHub Comments

2reactions

prabhat00155commented, Sep 10, 2021

I was able to reproduce the results:

Acc@1 79.314 Acc@5 94.566

Here is my output log: resnext101_32x8d_logs.txt

Comparing the logs shared above with mine, I see workers and world size being different. workers=16, world_size=8 vs workers=10, world_size=64

What did you pass for --gpus-per-node? At the top of the log file, it says 4 GPUs per node. I guess the reported results are with gpus-per-node=8.

This is the command I ran:

srun -p train --cpus-per-task=16 -t 110:00:00 --gpus-per-node=8 python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model resnext101_32x8d --epochs 100 --output-dir logs/run2 > logs/run2/resnext101_32x8d_logs.txt 2>&1

1reaction

netw0rkf10wcommented, Dec 21, 2021

@datumbox Thank you so much for the detailed response and for your transparency! In the issue that you mentioned, there appear to have enough information to reproduce the results (may except one detail, let me post a question there).

Top Results From Across the Web

Failure of Classification Accuracy for Imbalanced Class ...

Classification accuracy involves first using a classification model to make a prediction for each example in a test dataset. The predictions are ...

My dogs vs cats models always have 0.5 accuracy

Confirming issue is occurring: Method 1: accuracy for model stays around 0.5 while training (or 1/n where n is number of classes). Method...

Troubleshoot designer component errors - Azure Machine ...

Select the failed component, go to the Outputs+logs tab, ... trying to compare the accuracy of a linear regressor with a binary classifier....

Choosing the "Correct" Seed for Reproducible Research/Results

The whole point of the seeds is that you've got a script that someone can use to completely reproduce the exact results you...

Understanding Confusion Matrix | by Sarang Narkhede

Well, it is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Unable to reproduce classification accuracy using the reference scripts

🐛 Bug

To Reproduce

Expected behavior

Environment

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Installing from conda with official installation command latest PyTorch 1.9.0 installs torchvision 0.2.2

`torchvision` breaks in official `pytorch` Docker image: `RuntimeError: Couldn't load custom C++ ops.`