Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MPI_INIT failed when calling wandb.init()

See original GitHub issue

Weights and Biases version: 0.8.34
Python version: 3.8.1
Operating System: Arch Linux (kernel 5.5.7.arch-1)

Description

When using slurm to init wandb, it seems that it fail to init mpi. Maybe the error is due to wandb try to init mpi again when calling wandb.init().

What I Did

srun: job 52496 queued and waiting for resources
srun: job 52496 has been allocated resources
Python 3.8.1 (default, Jan 22 2020, 06:38:00)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import wandb
>>> wandb.init()
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: W&B is a tool that helps track and visualize machine learning experiments
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose 'Don't visualize my results'
wandb: No credentials found.  Run "wandb login" to visualize your metrics
wandb: Tracking run with wandb version 0.8.35
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[s15.speech:244894] Local abort before MPI_INIT completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: Job step aborted: Waiting up to 182 seconds for job step to finish.

Issue Analytics

State:
Created 3 years ago
Comments:20 (9 by maintainers)

Top GitHub Comments

1reaction

vanpeltcommented, May 18, 2020

Hey @pohanchi @Liangtaiwan we’re spinning up our own Slurm environment to reproduce this. In the mean time, can one or both of you try our new CLI which should work better in these environments? You can find it here: https://github.com/wandb/client-ng