Wandb only terminates one process when using DDP.
See original GitHub issueDescribe the bug
Each time I run a sweep (one or multiple runs per agent), some processes are left running (these hog 100% the processor they’re assigned to and eat up RAM). The data from GPUs is not cleared after finishing a sweep, leading to CUDA out of memory
error. Afterwards I need to kill all processes individually. After stopping the sweep I get this error.
/home/jpohjone/miniconda3/envs/models/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
There is no problem when running a normal run (ie. copying and running the same command that the sweep uses).
To Reproduce
I can create a small example script later I if seems the bug can’t be solved otherwise.
Expected behavior
Shut down the script correctly after sweep is done.
Screenshots
Snippet from htop
.
Operating system and other versions
- OS: Ubuntu 20.04.1 LT
- wand: 10.12
- python: 3.8.5
- torch: 1.7.0
- miniconda3
Additional context
I use four GPUs with DDP to distribute jobs, but the problem persists even when I only use 1 GPU.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:6
- Comments:11 (1 by maintainers)
Top GitHub Comments
Calling
wandb.finish()
in every process at the end solved the issue for me. Thanks!I also have this issue, I’m running with wandb
wandb, version 0.11.0
. Has anyone figured out how to fix it?