GPU memory leak while running sweeps
See original GitHub issuewandb --version && python --version && uname
Weights and Biases version: 0.9.7 Python version: 3.7.9 Operating System: Ubuntu 18.04LTS
Description
I’m running sweeps, and I notice that every so often one of the GPUs doesn’t reclaim all its memory after a training job goes away. It ends up in this horrible CUDA-bug state where nvidia-smi
reports that the memory is used in the top half, but in the bottom half doesn’t report any processes that owns that memory. I can only reclaim the memory by rebooting the machine. (I’ve read that sometimes nvidia-smi -r
will fix this, but it’s never let me reset the GPU that way I think because X-windows is running on it.)
What I Did
This is not a great bug report, because I don’t know how to repro it. I’m not even sure it’s anything to do with wandb, or just some bug between CUDA & pytorch or something. But I’ve seen it three or four times now, and only when running wandb sweeps. I’ve mostly been using hyperband early termination with my sweeps. And I sometimes will kill jobs manually from the wandb web UI. So I suspect it’s maybe got something to do with the way the agent kills the python process that’s using the GPU - maybe it’s not cleaning up properly.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:30 (9 by maintainers)
Top GitHub Comments
I see, thanks for following up. We’re looking into fixing this issue.
hello, i meet the same error while using sweep module. The program can not free the GPU memory by itself, i have to clean up the GPU memory after that. And seems i can not kill the program using ctrl+c, it will print a wandb log info “ctrl+c pressed” and run as normal. If I press ctrl+c twice, then the program is killed with leaked GPU memory. Any solutions on this issue? Thanks a lot.