question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MPI_INIT failed when calling wandb.init()

See original GitHub issue
  • Weights and Biases version: 0.8.34
  • Python version: 3.8.1
  • Operating System: Arch Linux (kernel 5.5.7.arch-1)

Description

When using slurm to init wandb, it seems that it fail to init mpi. Maybe the error is due to wandb try to init mpi again when calling wandb.init().

Here is the related issue https://github.com/open-mpi/ompi/issues/7025#issuecomment-536826728

What I Did

srun: job 52496 queued and waiting for resources
srun: job 52496 has been allocated resources
Python 3.8.1 (default, Jan 22 2020, 06:38:00)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import wandb
>>> wandb.init()
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: W&B is a tool that helps track and visualize machine learning experiments
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose 'Don't visualize my results'
wandb: No credentials found.  Run "wandb login" to visualize your metrics
wandb: Tracking run with wandb version 0.8.35
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[s15.speech:244894] Local abort before MPI_INIT completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: Job step aborted: Waiting up to 182 seconds for job step to finish.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:20 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
vanpeltcommented, May 18, 2020

Hey @pohanchi @Liangtaiwan we’re spinning up our own Slurm environment to reproduce this. In the mean time, can one or both of you try our new CLI which should work better in these environments? You can find it here: https://github.com/wandb/client-ng

pip install wandb-ng
0reactions
cvphelpscommented, Dec 2, 2020

This works with the new cli, pip install wandb version 0.10+. To get your API key go to wandb.ai/authorize.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MPI_INIT failed when calling wandb.init() · Issue #1024 - GitHub
When using slurm to init wandb, it seems that it fail to init mpi. Maybe the error is due to wandb try to...
Read more >
Every wandb command yields "An error occured in ...
init() in a python file and share the full error trace from there along with the debug.log and debug-internal.log files from the wandb...
Read more >
Launch Experiments with wandb.init ... - Weights & Biases
Call wandb.init() once at the beginning of your script to initialize a new job. This creates a new run in W&B and launches...
Read more >
wandb.init - Documentation - Weights & Biases
init() spawns a new background process to log data to a run, and it also syncs data to wandb.ai by default, so you...
Read more >
Troubleshooting - Documentation - Weights & Biases - WandB
Calling wandb.log writes a line to a local file; it does not block any network calls. When you call wandb.init we launch a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found