WandB doesn't work with DDP
See original GitHub issue🐛 Bug Report
I get the error while using WandbLogger with ddp enabled:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 829, in _run_stage
self._run_event("on_stage_start")
File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 784, in _run_event
getattr(self, event)(self)
File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 687, in on_stage_start
self.log_hparams(hparams=self.hparams, scope="stage")
File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 575, in log_hparams
stage_key=self.stage_key,
File "/usr/local/lib/python3.6/dist-packages/catalyst/loggers/wandb.py", line 171, in log_hparams
self.run.config.update(hparams)
File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 471, in config
return self._config
AttributeError: 'Run' object has no attribute '_config'
By now Wandb is initialized before fork in __init__
. It might be the source of the problem.
Environment
Catalyst version: 21.09
PyTorch version: 1.9.1+cu102
Is debug build: No
CUDA used to build PyTorch: 10.2
TensorFlow version: N/A
TensorBoard version: 2.6.0
OS: linux
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
Nvidia driver version: 455.45.01
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] catalyst==21.9
[pip3] numpy==1.19.5
[pip3] tensorboard==2.6.0
[pip3] tensorboard-data-server==0.6.1
[pip3] tensorboard-plugin-wit==1.8.0
[pip3] tensorboardX==2.2
[pip3] torch==1.9.1
[pip3] torchvision==0.10.1
[conda] Could not collect
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:13 (11 by maintainers)
Top Results From Across the Web
Log distributed training experiments - Weights & Biases - Wandb
This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel ......
Read more >WandbLogger cannot be used with 'ddp' · Issue #981 - GitHub
Bug wandb modifies init such that a child process calling init returns None if the master process has called init. This seems to...
Read more >wandb — PyTorch Lightning 1.8.5.post0 documentation
A new W&B run will be created when training starts if you have not created one manually before with wandb.init() . Log metrics....
Read more >Weights & Biases (with Dask Cluster) - Saturn Cloud
Overview. This example shows how to use Weights & Biases to monitor the progress of model training on resource with a Dask Cluster...
Read more >How to run an end to end example of distributed data parallel ...
cross posted: python - How to run an end to end example of ... a single GPU: DDP - Distributed DP ZeRO -...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The problem is only within DDP setup. I didn’t try previous versions of wandb.
@AyushExel could we add **kwargs passing to the wandb.init?