Wandb only terminates one process when using DDP.
See original GitHub issueDescribe the bug
Each time I run a sweep (one or multiple runs per agent), some processes are left running (these hog 100% the processor they’re assigned to and eat up RAM). The data from GPUs is not cleared after finishing a sweep, leading to  CUDA out of memory error.  Afterwards I need to kill all processes individually. After stopping the sweep I get this error.
/home/jpohjone/miniconda3/envs/models/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
There is no problem when running a normal run (ie. copying and running the same command that the sweep uses).
To Reproduce
I can create a small example script later I if seems the bug can’t be solved otherwise.
Expected behavior
Shut down the script correctly after sweep is done.
Screenshots
Snippet from htop.
Operating system and other versions
- OS: Ubuntu 20.04.1 LT
 - wand: 10.12
 - python: 3.8.5
 - torch: 1.7.0
 - miniconda3
 
Additional context
I use four GPUs with DDP to distribute jobs, but the problem persists even when I only use 1 GPU.
Issue Analytics
- State:
 - Created 3 years ago
 - Reactions:6
 - Comments:11 (1 by maintainers)
 

Top Related StackOverflow Question
Calling
wandb.finish()in every process at the end solved the issue for me. Thanks!I also have this issue, I’m running with wandb
wandb, version 0.11.0. Has anyone figured out how to fix it?