question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[slurm] wandb hangs at the end of jobs in dryrun mode

See original GitHub issue

wandb --version && python --version && uname

  • Weights and Biases version: 0.8.21
  • Python version: Python 3.6.8 :: Anaconda, Inc.
  • Operating System: CentOS Linux release 7.7.1908 (Core)

Description

I’m using wandb on the GPU cluster with slurm to run jobs. After the script finishes, wandb prints the following:

wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.

The problem is that the slurm scheduler doesn’t quit this job and occupies the GPU node. Perhaps, for some reason some wandb processes are still running?

Not sure if the issue is with wandb or with the cluster I’m using. The cluster is actually one of the biggest in Canada, so I can imagine other people have this issue and it can result in a lot of nodes being idle for no reason. So would be great to solve this.

Other clusters I’ve used with Ubuntu and Internet access worked fine.

I use WANDB_MODE=dryrun, because the cluster doesn’t have access to external network.

Update My impression is that wandb tries to connect to the server after the script is finished, but because there is no connection, it raises some exception and the process gets stuck for some reason.

In one of my log files I found an additional line printed at the end regarding the connection:

wandb: Waiting for W&B process to finish, PID {some process id}
wandb: Program ended successfully.
wandb: ERROR Failed to connect to W&B. Retrying in the background.

What I Did

see above

Thanks.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:5
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
vanpeltcommented, Aug 16, 2021

@lukekenworthy can you provide an example script? If you’re using multiprocessing in your scripts you may need to explicity call wandb.finish() in the process that called wandb.init once processing as completed.

1reaction
lukekenworthycommented, Aug 15, 2021

I am having this problem as well. Has anyone ever figured this out?

Read more comments on GitHub >

github_iconTop Results From Across the Web

[slurm] wandb hangs at the end of jobs in dryrun mode #919
I'm using wandb on the GPU cluster with slurm to run jobs. After the script finishes, wandb prints the following: wandb: Waiting for...
Read more >
Technical FAQ - Documentation - Weights & Biases
Frequently Asked Questions. General · Metrics & Performance · Setup · Troubleshooting · Previous. FAQ · Next. General. Last modified 6mo ago. Cookies....
Read more >
EasyBuild v4.6.2 documentation (release 20221021.0)
introduce EasyBlock.post_init method to correctly define builddir variable when build-in-installdir mode is enabled in easyconfig or easyblock ...
Read more >
PyTorch-Lightning Documentation
In this guide we'll show you how to organize your PyTorch code into Lightning in 3 simple steps. Organizing your code with PyTorch...
Read more >
Untitled
351w block hp limit, Audio design engineer jobs, Adele 25 lyrics album, ... Raquel freshfel, Deauville casino france, Muslim boy names ending with...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found