[slurm] wandb hangs at the end of jobs in dryrun mode
See original GitHub issuewandb --version && python --version && uname
- Weights and Biases version: 0.8.21
- Python version: Python 3.6.8 :: Anaconda, Inc.
- Operating System: CentOS Linux release 7.7.1908 (Core)
Description
I’m using wandb on the GPU cluster with slurm to run jobs. After the script finishes, wandb prints the following:
wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.
The problem is that the slurm scheduler doesn’t quit this job and occupies the GPU node. Perhaps, for some reason some wandb processes are still running?
Not sure if the issue is with wandb or with the cluster I’m using. The cluster is actually one of the biggest in Canada, so I can imagine other people have this issue and it can result in a lot of nodes being idle for no reason. So would be great to solve this.
Other clusters I’ve used with Ubuntu and Internet access worked fine.
I use WANDB_MODE=dryrun, because the cluster doesn’t have access to external network.
Update My impression is that wandb tries to connect to the server after the script is finished, but because there is no connection, it raises some exception and the process gets stuck for some reason.
In one of my log files I found an additional line printed at the end regarding the connection:
wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.
wandb: ERROR Failed to connect to W&B. Retrying in the background.
What I Did
see above
Thanks.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:5
- Comments:5 (3 by maintainers)
Top GitHub Comments
@lukekenworthy can you provide an example script? If you’re using multiprocessing in your scripts you may need to explicity call
wandb.finish()
in the process that called wandb.init once processing as completed.I am having this problem as well. Has anyone ever figured this out?