MPI_INIT failed when calling wandb.init()
See original GitHub issue- Weights and Biases version: 0.8.34
- Python version: 3.8.1
- Operating System: Arch Linux (kernel 5.5.7.arch-1)
Description
When using slurm to init wandb, it seems that it fail to init mpi. Maybe the error is due to wandb try to init mpi again when calling wandb.init().
Here is the related issue https://github.com/open-mpi/ompi/issues/7025#issuecomment-536826728
What I Did
srun: job 52496 queued and waiting for resources
srun: job 52496 has been allocated resources
Python 3.8.1 (default, Jan 22 2020, 06:38:00)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import wandb
>>> wandb.init()
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: WARNING Path /home/username/wandb/ wasn't writable, using system temp directory
wandb: W&B is a tool that helps track and visualize machine learning experiments
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose 'Don't visualize my results'
wandb: No credentials found. Run "wandb login" to visualize your metrics
wandb: Tracking run with wandb version 0.8.35
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[s15.speech:244894] Local abort before MPI_INIT completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: Job step aborted: Waiting up to 182 seconds for job step to finish.
Issue Analytics
- State:
- Created 3 years ago
- Comments:20 (9 by maintainers)
Top Results From Across the Web
MPI_INIT failed when calling wandb.init() · Issue #1024 - GitHub
When using slurm to init wandb, it seems that it fail to init mpi. Maybe the error is due to wandb try to...
Read more >Every wandb command yields "An error occured in ...
init() in a python file and share the full error trace from there along with the debug.log and debug-internal.log files from the wandb...
Read more >Launch Experiments with wandb.init ... - Weights & Biases
Call wandb.init() once at the beginning of your script to initialize a new job. This creates a new run in W&B and launches...
Read more >wandb.init - Documentation - Weights & Biases
init() spawns a new background process to log data to a run, and it also syncs data to wandb.ai by default, so you...
Read more >Troubleshooting - Documentation - Weights & Biases - WandB
Calling wandb.log writes a line to a local file; it does not block any network calls. When you call wandb.init we launch a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @pohanchi @Liangtaiwan we’re spinning up our own Slurm environment to reproduce this. In the mean time, can one or both of you try our new CLI which should work better in these environments? You can find it here: https://github.com/wandb/client-ng
This works with the new cli,
pip install wandb
version 0.10+. To get your API key go to wandb.ai/authorize.