wandb.watch(model) fails when using torch.nn.DataParallel(model)
See original GitHub issue- Weights and Biases version: 0.8.25
- Python version: 3.7.6
- Operating System: Linux
Description
I am trying to call wandb.watch(model)
to monitor the model, and also use torch.nn.DataParallel(model)
for multi-GPU training. However, the two seem to be incompatible.
What I Did
I have tried the following:
wandb.watch(model)
model = nn.DataParallel(model)
model = nn.DataParallel(model)
wandb.watch(model)
model = nn.DataParallel(model)
wandb.watch(model.module)
And all of them raise the same error, shown in the trace below.
Traceback (most recent call last):
File "train.py", line 71, in <module>
main()
File "train.py", line 67, in main
trainer.train(args.num_epochs)
File "/home/timbrooks/code/adv_rec/training/trainer.py", line 78, in train
self._step_decoder(image)
File "/home/timbrooks/code/adv_rec/training/trainer.py", line 99, in _step_decoder
loss.backward()
File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/wandb_torch.py", line 359, in backward_hook
wandb.run.summary["graph_%i" % graph_idx] = graph
File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/summary.py", line 134, in __setitem__
self._root._root_set(path, [(k, v)])
File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/summary.py", line 254, in _root_set
json_dict[new_key] = self._encode(new_value, path + (new_key,))
File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/summary.py", line 323, in _encode
friendly_value, converted = util.json_friendly(data_types.val_to_json(self._run, path, value))
File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/data_types.py", line 1365, in val_to_json
val.bind_to_run(run, key, step)
File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/data_types.py", line 973, in bind_to_run
super(Graph, self).bind_to_run(*args, **kwargs)
File "/home/timbrooks/anaconda3/envs/adv_rec/lib/python3.7/site-packages/wandb/data_types.py", line 162, in bind_to_run
raise RuntimeError('Value is already bound to a Run: {}'.format(self))
RuntimeError: Value is already bound to a Run
wandb: Waiting for W&B process to finish, PID 50739
: <wandb.wandb_torch.TorchGraph object at 0x7f3501e99950>
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: Process crashed early, not syncing files
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:8 (4 by maintainers)
Top Results From Across the Web
[CLI] It seems that wanbd make my parallel model fail ... - GitHub
After a test, I found that it was caused by adding wandb.watch(model). import torch import wandb from torch.nn.parallel import DataParallel ...
Read more >Log distributed training experiments - Weights & Biases - Wandb
This is a common solution for logging distributed training experiments with the PyTorch Distributed Data Parallel (DDP) Class. In some cases, users funnel...
Read more >Problems about torch.nn.DataParallel - Stack Overflow
This example shows how to use a model on a single GPU, setting the device using .to() instead of .cuda() . from torch...
Read more >Trainer — transformers 4.5.0.dev0 documentation
You can still use your own models defined as torch.nn. ... This is also not the same under DataParallel where gpu0 may require...
Read more >Changelog — PyTorch Lightning 1.8.5.post0 documentation
Fixed torchscript error with containers of LightningModules (#14904) ... Fixed an issue where the model wrapper in Lite converted non-floating point tensors ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
An experimental pip is available with this command: pip install --upgrade git+git://github.com/wandb/client.git@fix/pytorch-watch-parallel#egg=wandb
Thanks for the fix! Just made a new issue for the duplicate logs: https://github.com/wandb/client/issues/856