question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DistributedEvalSamper hangs at the end of the script when using DDP

See original GitHub issue

Dear author, Thank you at first for your great work!

I am trying to use your implementation of DistributedEvalSampler for an evaluation purpose, jointly with DDP. (with shuffle=Flase and no calling of set_epoch(); after calling DistributedEvalSampler for yielding test samples on evaluating a model, my program should be finished)

At the end of the script, my program hangs with charging 100% of GPU utilization in all 2 of 3 GPUs. (the last device is soley terminated with no errors) When replaced with DistributedSampler, this is not occurred.

I doubted it is because of the logging (e.g., Wandb) is occurred at rank 0 device, but it is not the root cause as it is still occurred when I turned off the logging tool.

I wonder if you could point out conditions that I missed, please? Thank you in advance.

Best, Adam

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
SeungjunNahcommented, Apr 26, 2022

Hi @vaseline555,

  1. Is your dataset size divisible by the number of GPUs? If so, there should be no difference in the behavior of DistributedSampler and DistributedEvalSampler.

  2. Are you using any kind of communication between processes that requires synchronization, i.e., back-propagation? DistributedEvalSampler does not require any communication between processes and I don’t think it will be the source of hanging. If you are using other synchronization-based operations, they may expect the same dataset length per process. For example, if your total dataset size is 5 and you are using 3 processes, GPU 0 and 1 will be processing the 2nd item while GPU 2 is done after the 1st iteration. If you are using a synchronization-based operation, GPU 0 and 1 will be waiting for the response from GPU 2 which will never occur. When I need to do backpropagation at test time for each item, I turn off synchronization.

self.model.model.G.require_backward_grad_sync = False   # compute without DDP sync

Best, Seungjun

0reactions
DaoDcommented, Jul 7, 2022

@SeungjunNah Thanks for your reply! I will try to use all_gather out of the data loop.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Script freezes with no output when using DistributedDataParallel
However, when using DDP, the script gets frozen at a random point. The GPU usage is stuck at 100% and the process is...
Read more >
DDP script hangs forever, doesn't run - PyTorch Forums
I tried to use it in a single-node, 4-GPU EC2 with 2 different techniques, both hang forever (1min+) with CPU and GPU idle....
Read more >
Pytorch Lightning duplicates main script in ddp mode
I have since moved on to use the native "ddp" with multiprocessing in PyTorch. As far as I understand, PytorchLightning (PTL) is just ......
Read more >
Introducing Distributed Data Parallel support on PyTorch ...
This can lead to freezing of the DDP training process, because the script fails to initialize the FileStore. A workaround is to manually...
Read more >
Distribute your PyTorch model in less than 20 lines of code
WARNING: If you accidentally send your model to cuda:0 before the DDP wrapping (maybe due to old residual code), the script freezes with...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found