Data is not moved to the correct device
See original GitHub issue🐛 Bug
Hi I have a model which requires some Data dependent Initialization (DDI). It only needs one example to configure correct starting weights for a certain layer. For that I have set the following hook in my LightningModule:
def on_fit_start(self) -> None:
if self.ddi:
for f in self.decoder.flows:
if getattr(f, "set_ddi", False):
f.set_ddi(True)
_LOGGER.info("Doing data-dependent model initialization")
for batch in self.trainer.datamodule.train_dataloader():
x, x_lengths, y, y_lengths, speaker_ids, _ = batch
with torch.no_grad():
self.forward(x, x_lengths, y, y_lengths, g=speaker_ids)
break
This was working fine in version 1.5 but it’s now broken in version 1.6. I get:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
Debugging I found that data is on device -1
which actually means all GPUs if I understand correctly. This is the case for both versions.
Also, although it works on version 1.5 self.device
is invalid inside the forward method.
x = self.emb(x) * torch.sqrt(torch.tensor([self.hidden_channels], device=self.device)) # [b, t, h]
File "/home/aalvarez/.virtualenvs/tts-train-XZ1ykfT_-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'TextEncoder' object has no attribute 'device'
To Reproduce
In whichever model just set the following hook:
def on_fit_start(self) -> None:
for batch in self.trainer.datamodule.train_dataloader():
with torch.no_grad():
self.forward(*batch)
break
Expected behavior
Data should be on the correct device
Environment
* CUDA:
- GPU:
- NVIDIA GeForce RTX 2080 Ti
- NVIDIA GeForce RTX 2080 Ti
- NVIDIA GeForce RTX 2080 Ti
- NVIDIA GeForce RTX 2080 Ti
- NVIDIA GeForce RTX 2080 Ti
- NVIDIA GeForce RTX 2080 Ti
- NVIDIA GeForce RTX 2080 Ti
- NVIDIA GeForce RTX 2080 Ti
- NVIDIA GeForce RTX 2080 Ti
- NVIDIA GeForce RTX 2080 Ti
- available: True
- version: 10.2
* Packages:
- numpy: 1.23.0
- pyTorch_debug: False
- pyTorch_version: 1.11.0+cu102
- pytorch-lightning: 1.5.10 or 1.6.4
- tqdm: 4.64.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.10
- version: #113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Can't transfer data to a new Pixel phone - Google Help
If something goes wrong with copying data from your current phone to your Pixel phone during setup, try the following troubleshooting steps.
Read more >Use Quick Start to transfer data to a new iPhone or iPad
Make sure that your current device is connected to Wi-Fi and Bluetooth is on. Turn on your new device and place it near...
Read more >What to do if Samsung Smart Switch is not Working
Solution 6: Check the data volumn transferred. Connecting both devices via Smart Switch is not enough. Sometimes, even after connecting the phones and...
Read more >How to move your data to a new Android phone or iPhone
Tap the correct device icon on each phone. ... Pick the ones you want to migrate (you might not want to move brand-specific...
Read more >Move your files to a new Windows PC using an external ...
Then you'll need to safely remove the drive to ensure no files are lost or corrupted. To remove the device, select Eject before...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
For training they are not. But for validation, they are. If you want to validate on X different datasets, you can do so by giving a Sequence of X dataloaders to the validation data. Sometimes batches from different loaders need to be treated as such, that is why you always get the index of the respective loader (0 if only a single loader is given)
Since you are accessing the raw dataloader manually yourself, you need to move the data to the right device when you iterate over it. So here in your code example:
Yes, on a regular nn.Module this is not defined. You can only access it on a LightningModule.