Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DistributedEvalSamper hangs at the end of the script when using DDP

See original GitHub issue

Dear author, Thank you at first for your great work!

I am trying to use your implementation of DistributedEvalSampler for an evaluation purpose, jointly with DDP. (with shuffle=Flase and no calling of set_epoch(); after calling DistributedEvalSampler for yielding test samples on evaluating a model, my program should be finished)

At the end of the script, my program hangs with charging 100% of GPU utilization in all 2 of 3 GPUs. (the last device is soley terminated with no errors) When replaced with DistributedSampler, this is not occurred.

I doubted it is because of the logging (e.g., Wandb) is occurred at rank 0 device, but it is not the root cause as it is still occurred when I turned off the logging tool.

I wonder if you could point out conditions that I missed, please? Thank you in advance.

Best, Adam

Issue Analytics

State:
Created a year ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

SeungjunNahcommented, Apr 26, 2022

Hi @vaseline555,

Is your dataset size divisible by the number of GPUs? If so, there should be no difference in the behavior of DistributedSampler and DistributedEvalSampler.
Are you using any kind of communication between processes that requires synchronization, i.e., back-propagation? DistributedEvalSampler does not require any communication between processes and I don’t think it will be the source of hanging. If you are using other synchronization-based operations, they may expect the same dataset length per process. For example, if your total dataset size is 5 and you are using 3 processes, GPU 0 and 1 will be processing the 2nd item while GPU 2 is done after the 1st iteration. If you are using a synchronization-based operation, GPU 0 and 1 will be waiting for the response from GPU 2 which will never occur. When I need to do backpropagation at test time for each item, I turn off synchronization.

self.model.model.G.require_backward_grad_sync = False   # compute without DDP sync

Best, Seungjun

0reactions

DaoDcommented, Jul 7, 2022

@SeungjunNah Thanks for your reply! I will try to use all_gather out of the data loop.

Top Results From Across the Web

Script freezes with no output when using DistributedDataParallel

However, when using DDP, the script gets frozen at a random point. The GPU usage is stuck at 100% and the process is...

DDP script hangs forever, doesn't run - PyTorch Forums

I tried to use it in a single-node, 4-GPU EC2 with 2 different techniques, both hang forever (1min+) with CPU and GPU idle....

Pytorch Lightning duplicates main script in ddp mode

I have since moved on to use the native "ddp" with multiprocessing in PyTorch. As far as I understand, PytorchLightning (PTL) is just ......

Introducing Distributed Data Parallel support on PyTorch ...

This can lead to freezing of the DDP training process, because the script fails to initialize the FileStore. A workaround is to manually...

Distribute your PyTorch model in less than 20 lines of code

WARNING: If you accidentally send your model to cuda:0 before the DDP wrapping (maybe due to old residual code), the script freezes with...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

DistributedEvalSamper hangs at the end of the script when using DDP

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

TypeError when trying to sort a string of a number and a string

Running Demo (MultiSaver) on Windows