question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU memory leak while running sweeps

See original GitHub issue

wandb --version && python --version && uname

Weights and Biases version: 0.9.7 Python version: 3.7.9 Operating System: Ubuntu 18.04LTS

Description

I’m running sweeps, and I notice that every so often one of the GPUs doesn’t reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn’t report any processes that owns that memory. I can only reclaim the memory by rebooting the machine. (I’ve read that sometimes nvidia-smi -r will fix this, but it’s never let me reset the GPU that way I think because X-windows is running on it.)

What I Did

This is not a great bug report, because I don’t know how to repro it. I’m not even sure it’s anything to do with wandb, or just some bug between CUDA & pytorch or something. But I’ve seen it three or four times now, and only when running wandb sweeps. I’ve mostly been using hyperband early termination with my sweeps. And I sometimes will kill jobs manually from the wandb web UI. So I suspect it’s maybe got something to do with the way the agent kills the python process that’s using the GPU - maybe it’s not cleaning up properly.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:4
  • Comments:30 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
tyomhakcommented, Oct 21, 2020

I see, thanks for following up. We’re looking into fixing this issue.

2reactions
zoeyuchaocommented, Oct 15, 2020

hello, i meet the same error while using sweep module. The program can not free the GPU memory by itself, i have to clean up the GPU memory after that. And seems i can not kill the program using ctrl+c, it will print a wandb log info “ctrl+c pressed” and run as normal. If I press ctrl+c twice, then the program is killed with leaked GPU memory. Any solutions on this issue? Thanks a lot.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GPU memory leak while running sweeps · Issue #1247 - GitHub
I'm running sweeps, and I notice that every so often one of the GPUs doesn't reclaim all its memory after a training job...
Read more >
GPU memory leak? - Spacedesk
When I don't use SpaceDesk, my PC runs fine 3-4 months at a time between manual reboots for system and driver updates. I...
Read more >
Memory leak - Wikipedia
In computer science, a memory leak is a type of resource leak that occurs when a computer program incorrectly manages memory allocations in...
Read more >
Mitigating a memory leak in Tensorflow's LSTM - Greg Zynda
I have been running a parameter sweep on a recurrent neural network (RNN) ... Potential memory leak when using LSTM + TimeDistributed.
Read more >
Debugging Native Memory Use | Android Open Source Project
Android's libmemunreachable is a zero-overhead native memory leak detector. It uses an imprecise mark-and-sweep garbage collector pass over all ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found