Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

WandB doesn't work with DDP

See original GitHub issue

🐛 Bug Report

I get the error while using WandbLogger with ddp enabled:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 829, in _run_stage
    self._run_event("on_stage_start")
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 784, in _run_event
    getattr(self, event)(self)
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 687, in on_stage_start
    self.log_hparams(hparams=self.hparams, scope="stage")
  File "/usr/local/lib/python3.6/dist-packages/catalyst/core/runner.py", line 575, in log_hparams
    stage_key=self.stage_key,
  File "/usr/local/lib/python3.6/dist-packages/catalyst/loggers/wandb.py", line 171, in log_hparams
    self.run.config.update(hparams)
  File "/usr/local/lib/python3.6/dist-packages/wandb/sdk/wandb_run.py", line 471, in config
    return self._config
AttributeError: 'Run' object has no attribute '_config'

By now Wandb is initialized before fork in __init__. It might be the source of the problem.

Environment

Catalyst version: 21.09
PyTorch version: 1.9.1+cu102
Is debug build: No
CUDA used to build PyTorch: 10.2
TensorFlow version: N/A
TensorBoard version: 2.6.0

OS: linux
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 455.45.01
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] catalyst==21.9
[pip3] numpy==1.19.5
[pip3] tensorboard==2.6.0
[pip3] tensorboard-data-server==0.6.1
[pip3] tensorboard-plugin-wit==1.8.0
[pip3] tensorboardX==2.2
[pip3] torch==1.9.1
[pip3] torchvision==0.10.1
[conda] Could not collect

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:13 (11 by maintainers)

Top GitHub Comments

2reactions

ivan-chaicommented, Oct 6, 2021

The problem is only within DDP setup. I didn’t try previous versions of wandb.

1reaction

Scitatorcommented, Dec 18, 2021

@AyushExel could we add **kwargs passing to the wandb.init?

Top Results From Across the Web

Log distributed training experiments - Weights & Biases - Wandb

This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel ......

WandbLogger cannot be used with 'ddp' · Issue #981 - GitHub

Bug wandb modifies init such that a child process calling init returns None if the master process has called init. This seems to...

wandb — PyTorch Lightning 1.8.5.post0 documentation

A new W&B run will be created when training starts if you have not created one manually before with wandb.init() . Log metrics....

Weights & Biases (with Dask Cluster) - Saturn Cloud

Overview. This example shows how to use Weights & Biases to monitor the progress of model training on resource with a Dask Cluster...

How to run an end to end example of distributed data parallel ...

cross posted: python - How to run an end to end example of ... a single GPU: DDP - Distributed DP ZeRO -...