question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

WandB doesn't work with DDP

See original GitHub issue

🐛 Bug Report

I get the error while using WandbLogger with ddp enabled:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 829, in _run_stage
    self._run_event("on_stage_start")
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 784, in _run_event
    getattr(self, event)(self)
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 687, in on_stage_start
    self.log_hparams(hparams=self.hparams, scope="stage")
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 575, in log_hparams
    stage_key=self.stage_key,
  File "/usr/local/lib/python3.6/dist-packages/catalyst/loggers/wandb.py", line 171, in log_hparams
    self.run.config.update(hparams)
  File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 471, in config
    return self._config
AttributeError: 'Run' object has no attribute '_config'

By now Wandb is initialized before fork in __init__. It might be the source of the problem.

Environment

Catalyst version: 21.09
PyTorch version: 1.9.1+cu102
Is debug build: No
CUDA used to build PyTorch: 10.2
TensorFlow version: N/A
TensorBoard version: 2.6.0

OS: linux
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 455.45.01
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] catalyst==21.9
[pip3] numpy==1.19.5
[pip3] tensorboard==2.6.0
[pip3] tensorboard-data-server==0.6.1
[pip3] tensorboard-plugin-wit==1.8.0
[pip3] tensorboardX==2.2
[pip3] torch==1.9.1
[pip3] torchvision==0.10.1
[conda] Could not collect

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:13 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
ivan-chaicommented, Oct 6, 2021

The problem is only within DDP setup. I didn’t try previous versions of wandb.

1reaction
Scitatorcommented, Dec 18, 2021

@AyushExel could we add **kwargs passing to the wandb.init?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Log distributed training experiments - Weights & Biases - Wandb
This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel ......
Read more >
WandbLogger cannot be used with 'ddp' · Issue #981 - GitHub
Bug wandb modifies init such that a child process calling init returns None if the master process has called init. This seems to...
Read more >
wandb — PyTorch Lightning 1.8.5.post0 documentation
A new W&B run will be created when training starts if you have not created one manually before with wandb.init() . Log metrics....
Read more >
Weights & Biases (with Dask Cluster) - Saturn Cloud
Overview. This example shows how to use Weights & Biases to monitor the progress of model training on resource with a Dask Cluster...
Read more >
How to run an end to end example of distributed data parallel ...
cross posted: python - How to run an end to end example of ... a single GPU: DDP - Distributed DP ZeRO -...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found