Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[slurm] wandb hangs at the end of jobs in dryrun mode

See original GitHub issue

wandb --version && python --version && uname

Weights and Biases version: 0.8.21
Python version: Python 3.6.8 :: Anaconda, Inc.
Operating System: CentOS Linux release 7.7.1908 (Core)

Description

I’m using wandb on the GPU cluster with slurm to run jobs. After the script finishes, wandb prints the following:

wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.

The problem is that the slurm scheduler doesn’t quit this job and occupies the GPU node. Perhaps, for some reason some wandb processes are still running?

Not sure if the issue is with wandb or with the cluster I’m using. The cluster is actually one of the biggest in Canada, so I can imagine other people have this issue and it can result in a lot of nodes being idle for no reason. So would be great to solve this.

Other clusters I’ve used with Ubuntu and Internet access worked fine.

I use WANDB_MODE=dryrun, because the cluster doesn’t have access to external network.

Update My impression is that wandb tries to connect to the server after the script is finished, but because there is no connection, it raises some exception and the process gets stuck for some reason.

In one of my log files I found an additional line printed at the end regarding the connection:

wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.
wandb: ERROR Failed to connect to W&B. Retrying in the background.

What I Did

see above

Thanks.

Issue Analytics

State:
Created 4 years ago
Reactions:5
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

vanpeltcommented, Aug 16, 2021

@lukekenworthy can you provide an example script? If you’re using multiprocessing in your scripts you may need to explicity call wandb.finish() in the process that called wandb.init once processing as completed.

1reaction

lukekenworthycommented, Aug 15, 2021

I am having this problem as well. Has anyone ever figured this out?

Top Results From Across the Web

[slurm] wandb hangs at the end of jobs in dryrun mode #919

I'm using wandb on the GPU cluster with slurm to run jobs. After the script finishes, wandb prints the following: wandb: Waiting for...

Technical FAQ - Documentation - Weights & Biases

Frequently Asked Questions. General · Metrics & Performance · Setup · Troubleshooting · Previous. FAQ · Next. General. Last modified 6mo ago. Cookies....

EasyBuild v4.6.2 documentation (release 20221021.0)

introduce EasyBlock.post_init method to correctly define builddir variable when build-in-installdir mode is enabled in easyconfig or easyblock ...

PyTorch-Lightning Documentation

In this guide we'll show you how to organize your PyTorch code into Lightning in 3 simple steps. Organizing your code with PyTorch...

Untitled

351w block hp limit, Audio design engineer jobs, Adele 25 lyrics album, ... Raquel freshfel, Deauville casino france, Muslim boy names ending with...