DistributedEvalSamper hangs at the end of the script when using DDP
See original GitHub issueDear author, Thank you at first for your great work!
I am trying to use your implementation of DistributedEvalSampler
for an evaluation purpose, jointly with DDP.
(with shuffle=Flase
and no calling of set_epoch()
; after calling DistributedEvalSampler
for yielding test samples on evaluating a model, my program should be finished)
At the end of the script, my program hangs with charging 100% of GPU utilization in all 2 of 3 GPUs.
(the last device is soley terminated with no errors)
When replaced with DistributedSampler
, this is not occurred.
I doubted it is because of the logging (e.g., Wandb) is occurred at rank 0 device, but it is not the root cause as it is still occurred when I turned off the logging tool.
I wonder if you could point out conditions that I missed, please? Thank you in advance.
Best, Adam
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Hi @vaseline555,
Is your dataset size divisible by the number of GPUs? If so, there should be no difference in the behavior of
DistributedSampler
andDistributedEvalSampler
.Are you using any kind of communication between processes that requires synchronization, i.e., back-propagation?
DistributedEvalSampler
does not require any communication between processes and I don’t think it will be the source of hanging. If you are using other synchronization-based operations, they may expect the same dataset length per process. For example, if your total dataset size is 5 and you are using 3 processes, GPU 0 and 1 will be processing the 2nd item while GPU 2 is done after the 1st iteration. If you are using a synchronization-based operation, GPU 0 and 1 will be waiting for the response from GPU 2 which will never occur. When I need to do backpropagation at test time for each item, I turn off synchronization.Best, Seungjun
@SeungjunNah Thanks for your reply! I will try to use all_gather out of the data loop.