Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

wandb.watch(model) fails when using torch.nn.DataParallel(model)

See original GitHub issue

Weights and Biases version: 0.8.25
Python version: 3.7.6
Operating System: Linux

Description

I am trying to call wandb.watch(model) to monitor the model, and also use torch.nn.DataParallel(model) for multi-GPU training. However, the two seem to be incompatible.

What I Did

I have tried the following:

wandb.watch(model)
model = nn.DataParallel(model)

model = nn.DataParallel(model)
wandb.watch(model)

model = nn.DataParallel(model)
wandb.watch(model.module)

And all of them raise the same error, shown in the trace below.

Traceback (most recent call last):
  File "train.py", line 71, in <module>
    main()
  File "train.py", line 67, in main
    trainer.train(args.num_epochs)
  File "/home/timbrooks/code/adv_rec/training/trainer.py", line 78, in train
    self._step_decoder(image)
  File "/home/timbrooks/code/adv_rec/training/trainer.py", line 99, in _step_decoder
    loss.backward()
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/wandb_torch.py", line 359, in backward_hook
    wandb.run.summary["graph_%i" % graph_idx] = graph
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/summary.py", line 134, in __setitem__
    self._root._root_set(path, [(k, v)])
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/summary.py", line 254, in _root_set
    json_dict[new_key] = self._encode(new_value, path + (new_key,))
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/summary.py", line 323, in _encode
    friendly_value, converted = util.json_friendly(data_types.val_to_json(self._run, path, value))
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/data_types.py", line 1365, in val_to_json
    val.bind_to_run(run, key, step)
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/data_types.py", line 973, in bind_to_run
    super(Graph, self).bind_to_run(*args, **kwargs)
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/data_types.py", line 162, in bind_to_run
    raise RuntimeError('Value is already bound to a Run: {}'.format(self))
RuntimeError: Value is already bound to a Run
wandb: Waiting for W&B process to finish, PID 50739
: <wandb.wandb_torch.TorchGraph object at 0x7f3501e99950>
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: Process crashed early, not syncing files

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

raubitsjcommented, Feb 8, 2020

An experimental pip is available with this command: pip install --upgrade git+git://github.com/wandb/client.git@fix/pytorch-watch-parallel#egg=wandb

0reactions

timothybrookscommented, Feb 10, 2020

Thanks for the fix! Just made a new issue for the duplicate logs: https://github.com/wandb/client/issues/856

Top Results From Across the Web

[CLI] It seems that wanbd make my parallel model fail ... - GitHub

After a test, I found that it was caused by adding wandb.watch(model). import torch import wandb from torch.nn.parallel import DataParallel ...

Log distributed training experiments - Weights & Biases - Wandb

This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel...

Problems about torch.nn.DataParallel - Stack Overflow

This example shows how to use a model on a single GPU, setting the device using .to() instead of .cuda() . from torch...

Trainer — transformers 4.5.0.dev0 documentation

You can still use your own models defined as torch.nn. ... This is also not the same under DataParallel where gpu0 may require...

Changelog — PyTorch Lightning 1.8.5.post0 documentation

Fixed torchscript error with containers of LightningModules (#14904) ... Fixed an issue where the model wrapper in Lite converted non-floating point tensors ...