question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

wandb.watch(model) fails when using torch.nn.DataParallel(model)

See original GitHub issue
  • Weights and Biases version: 0.8.25
  • Python version: 3.7.6
  • Operating System: Linux

Description

I am trying to call wandb.watch(model) to monitor the model, and also use torch.nn.DataParallel(model) for multi-GPU training. However, the two seem to be incompatible.

What I Did

I have tried the following:

wandb.watch(model)
model = nn.DataParallel(model)
model = nn.DataParallel(model)
wandb.watch(model)
model = nn.DataParallel(model)
wandb.watch(model.module)

And all of them raise the same error, shown in the trace below.

Traceback (most recent call last):
  File "train.py", line 71, in <module>
    main()
  File "train.py", line 67, in main
    trainer.train(args.num_epochs)
  File "/home/timbrooks/code/adv_rec/training/trainer.py", line 78, in train
    self._step_decoder(image)
  File "/home/timbrooks/code/adv_rec/training/trainer.py", line 99, in _step_decoder
    loss.backward()
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/wandb_torch.py", line 359, in backward_hook
    wandb.run.summary["graph_%i" % graph_idx] = graph
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/summary.py", line 134, in __setitem__
    self._root._root_set(path, [(k, v)])
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/summary.py", line 254, in _root_set
    json_dict[new_key] = self._encode(new_value, path + (new_key,))
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/summary.py", line 323, in _encode
    friendly_value, converted = util.json_friendly(data_types.val_to_json(self._run, path, value))
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/data_types.py", line 1365, in val_to_json
    val.bind_to_run(run, key, step)
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/data_types.py", line 973, in bind_to_run
    super(Graph, self).bind_to_run(*args, **kwargs)
  File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/data_types.py", line 162, in bind_to_run
    raise RuntimeError('Value is already bound to a Run: {}'.format(self))
RuntimeError: Value is already bound to a Run
wandb: Waiting for W&B process to finish, PID 50739
: <wandb.wandb_torch.TorchGraph object at 0x7f3501e99950>
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: Process crashed early, not syncing files

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
raubitsjcommented, Feb 8, 2020

An experimental pip is available with this command: pip install --upgrade git+git://github.com/wandb/client.git@fix/pytorch-watch-parallel#egg=wandb

0reactions
timothybrookscommented, Feb 10, 2020

Thanks for the fix! Just made a new issue for the duplicate logs: https://github.com/wandb/client/issues/856

Read more comments on GitHub >

github_iconTop Results From Across the Web

[CLI] It seems that wanbd make my parallel model fail ... - GitHub
After a test, I found that it was caused by adding wandb.watch(model). import torch import wandb from torch.nn.parallel import DataParallel ...
Read more >
Log distributed training experiments - Weights & Biases - Wandb
This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel...
Read more >
Problems about torch.nn.DataParallel - Stack Overflow
This example shows how to use a model on a single GPU, setting the device using .to() instead of .cuda() . from torch...
Read more >
Trainer — transformers 4.5.0.dev0 documentation
You can still use your own models defined as torch.nn. ... This is also not the same under DataParallel where gpu0 may require...
Read more >
Changelog — PyTorch Lightning 1.8.5.post0 documentation
Fixed torchscript error with containers of LightningModules (#14904) ... Fixed an issue where the model wrapper in Lite converted non-floating point tensors ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found