question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wandb only terminates one process when using DDP.

See original GitHub issue

Describe the bug

Each time I run a sweep (one or multiple runs per agent), some processes are left running (these hog 100% the processor they’re assigned to and eat up RAM). The data from GPUs is not cleared after finishing a sweep, leading to CUDA out of memory error. Afterwards I need to kill all processes individually. After stopping the sweep I get this error.

/home/jpohjone/miniconda3/envs/models/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown

There is no problem when running a normal run (ie. copying and running the same command that the sweep uses).

To Reproduce

I can create a small example script later I if seems the bug can’t be solved otherwise.

Expected behavior

Shut down the script correctly after sweep is done.

Screenshots

Snippet from htop.

htop

Operating system and other versions

  • OS: Ubuntu 20.04.1 LT
  • wand: 10.12
  • python: 3.8.5
  • torch: 1.7.0
  • miniconda3

Additional context

I use four GPUs with DDP to distribute jobs, but the problem persists even when I only use 1 GPU.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:6
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
shariqfarooq123commented, Jan 11, 2022

Calling wandb.finish() in every process at the end solved the issue for me. Thanks!

1reaction
viniciusdsmellocommented, Jul 18, 2021

I also have this issue, I’m running with wandb wandb, version 0.11.0. Has anyone figured out how to fix it?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wandb only terminates one process when using DDP. #1579
When I'm using DistributedDataParallel (DDP) to distribute jobs across 4 GPUs and kill a sweep with either ctrl+C or within my training script...
Read more >
Log distributed training experiments - Weights & Biases - WandB
This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel...
Read more >
Log distributed training experiments - Weights & Biases - Wandb
This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel...
Read more >
What happens if the code crashes in the middle and there was ...
A wandb run can be in any one of the stages: running, finished, crashed. In my experience, it's best to use a context...
Read more >
Launch Experiments with wandb.init - Documentation
Call wandb.init from just one process and pass data to be logged over ... for more detail on these two approaches, including code...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found