Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU memory leak while running sweeps

See original GitHub issue

wandb --version && python --version && uname

Weights and Biases version: 0.9.7 Python version: 3.7.9 Operating System: Ubuntu 18.04LTS

Description

I’m running sweeps, and I notice that every so often one of the GPUs doesn’t reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where nvidia-smi reports that the memory is used in the top half, but in the bottom half doesn’t report any processes that owns that memory. I can only reclaim the memory by rebooting the machine. (I’ve read that sometimes nvidia-smi -r will fix this, but it’s never let me reset the GPU that way I think because X-windows is running on it.)

What I Did

This is not a great bug report, because I don’t know how to repro it. I’m not even sure it’s anything to do with wandb, or just some bug between CUDA & pytorch or something. But I’ve seen it three or four times now, and only when running wandb sweeps. I’ve mostly been using hyperband early termination with my sweeps. And I sometimes will kill jobs manually from the wandb web UI. So I suspect it’s maybe got something to do with the way the agent kills the python process that’s using the GPU - maybe it’s not cleaning up properly.

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:30 (9 by maintainers)

Top GitHub Comments

2reactions

tyomhakcommented, Oct 21, 2020

I see, thanks for following up. We’re looking into fixing this issue.

2reactions

zoeyuchaocommented, Oct 15, 2020

hello, i meet the same error while using sweep module. The program can not free the GPU memory by itself, i have to clean up the GPU memory after that. And seems i can not kill the program using ctrl+c, it will print a wandb log info “ctrl+c pressed” and run as normal. If I press ctrl+c twice, then the program is killed with leaked GPU memory. Any solutions on this issue? Thanks a lot.