Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wandb only terminates one process when using DDP.

See original GitHub issue

Describe the bug

Each time I run a sweep (one or multiple runs per agent), some processes are left running (these hog 100% the processor they’re assigned to and eat up RAM). The data from GPUs is not cleared after finishing a sweep, leading to CUDA out of memory error. Afterwards I need to kill all processes individually. After stopping the sweep I get this error.

/home/jpohjone/miniconda3/envs/models/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown

There is no problem when running a normal run (ie. copying and running the same command that the sweep uses).

To Reproduce

I can create a small example script later I if seems the bug can’t be solved otherwise.

Expected behavior

Shut down the script correctly after sweep is done.

Screenshots

Snippet from htop.

Operating system and other versions

OS: Ubuntu 20.04.1 LT
wand: 10.12
python: 3.8.5
torch: 1.7.0
miniconda3

Additional context

I use four GPUs with DDP to distribute jobs, but the problem persists even when I only use 1 GPU.

Issue Analytics

State:
Created 3 years ago
Reactions:6
Comments:11 (1 by maintainers)

Top GitHub Comments

1reaction

shariqfarooq123commented, Jan 11, 2022

Calling wandb.finish() in every process at the end solved the issue for me. Thanks!

1reaction

viniciusdsmellocommented, Jul 18, 2021

I also have this issue, I’m running with wandb wandb, version 0.11.0. Has anyone figured out how to fix it?

Top Results From Across the Web

Wandb only terminates one process when using DDP. #1579

When I'm using DistributedDataParallel (DDP) to distribute jobs across 4 GPUs and kill a sweep with either ctrl+C or within my training script...

Log distributed training experiments - Weights & Biases - WandB

This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel...

Log distributed training experiments - Weights & Biases - Wandb

This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel...

What happens if the code crashes in the middle and there was ...

A wandb run can be in any one of the stages: running, finished, crashed. In my experience, it's best to use a context...

Launch Experiments with wandb.init - Documentation

Call wandb.init from just one process and pass data to be logged over ... for more detail on these two approaches, including code...