Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

multi-gpu ddp calls validation and testing loops too many times

See original GitHub issue

When using ddp with multiple gpus, each validation and test loop is called with the entire validation dataset for each gpu.

Expected behavior is that the dataset is divided appropriately across the gpus.

I am using current master (cloned Mar 14), Ubuntu 19.10, Cuda 10.1, python 3.7.5, pytorch 1.4, venv environment.

The problem appears to be in auto_add_sampler() in data_loading.py. It does not create a DistributedSampler for validation or test datasets.

Issue Analytics

State:
Created 4 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

sneimancommented, Mar 18, 2020

will do on both pr, and hash ref

1reaction

sneimancommented, Mar 17, 2020

Testing underway. Will make PR tomorrow.

Top Results From Across the Web

multi-gpu ddp calls validation and testing loops too many times

When using ddp with multiple gpus, each validation and test loop is called with the entire validation dataset for each gpu.

From PyTorch DDP to Accelerate to Trainer, mastery of ...

This tutorial assumes you have a basic understanding of PyTorch and how to train a simple model. It will showcase training on multiple...

GPU training (Intermediate) - PyTorch Lightning - Read the Docs

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment variables: # example for 3 GPUs...

How distributed training works in Pytorch - AI Summer

In this tutorial, we will learn how to use nn.parallel.DistributedDataParallel for training our models in multiple GPUs.

Dope report – Weights & Biases - Wandb

Research often involves editing the boiler plate code with new ... out the main parts of the training loop and the validation loop...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

multi-gpu ddp calls validation and testing loops too many times

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

No Callbacks for Validation Batch Step - How To Get Progress of Validation?

ReduceLROnPlateau does not recognise val_loss despite progress_bar dict