Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data is not moved to the correct device

See original GitHub issue

🐛 Bug

Hi I have a model which requires some Data dependent Initialization (DDI). It only needs one example to configure correct starting weights for a certain layer. For that I have set the following hook in my LightningModule:

    def on_fit_start(self) -> None:
        if self.ddi:
            for f in self.decoder.flows:
                if getattr(f, "set_ddi", False):
                    f.set_ddi(True)

            _LOGGER.info("Doing data-dependent model initialization")
            for batch in self.trainer.datamodule.train_dataloader():
                x, x_lengths, y, y_lengths, speaker_ids, _ = batch
                with torch.no_grad():
                    self.forward(x, x_lengths, y, y_lengths, g=speaker_ids)
                break

This was working fine in version 1.5 but it’s now broken in version 1.6. I get:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

Debugging I found that data is on device -1 which actually means all GPUs if I understand correctly. This is the case for both versions.

Also, although it works on version 1.5 self.device is invalid inside the forward method.

    x = self.emb(x) * torch.sqrt(torch.tensor([self.hidden_channels], device=self.device))  # [b, t, h]
  File "/home/aalvarez/.virtualenvs/tts-train-XZ1ykfT_-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'TextEncoder' object has no attribute 'device'

To Reproduce

In whichever model just set the following hook:

    def on_fit_start(self) -> None:
            for batch in self.trainer.datamodule.train_dataloader():
                with torch.no_grad():
                    self.forward(*batch)
                break

Expected behavior

Data should be on the correct device

Environment

* CUDA:
        - GPU:
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.23.0
        - pyTorch_debug:     False
        - pyTorch_version:   1.11.0+cu102
        - pytorch-lightning: 1.5.10  or 1.6.4
        - tqdm:              4.64.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.10
        - version:           #113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022

cc @justusschock @awaelchli @ninginthecloud @rohitgr7 @otaj

Issue Analytics

State:
Created a year ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

justusschockcommented, Jun 30, 2022

For training they are not. But for validation, they are. If you want to validate on X different datasets, you can do so by giving a Sequence of X dataloaders to the validation data. Sometimes batches from different loaders need to be treated as such, that is why you always get the index of the respective loader (0 if only a single loader is given)

1reaction

awaelchlicommented, Jun 29, 2022

Since you are accessing the raw dataloader manually yourself, you need to move the data to the right device when you iterate over it. So here in your code example:

def on_fit_start(self) -> None:
    for batch in self.trainer.datamodule.train_dataloader():
        with torch.no_grad():
            # Add this (assuming batch is a tensor)
            batch.to(self.device)
            # self.forward(batch)  # dont do this
            self(batch)  # do this
        break

Also, although it works on version 1.5 self.device is invalid inside the forward method.

Yes, on a regular nn.Module this is not defined. You can only access it on a LightningModule.

Top Results From Across the Web

Can't transfer data to a new Pixel phone - Google Help

If something goes wrong with copying data from your current phone to your Pixel phone during setup, try the following troubleshooting steps.

Use Quick Start to transfer data to a new iPhone or iPad

Make sure that your current device is connected to Wi-Fi and Bluetooth is on. Turn on your new device and place it near...

What to do if Samsung Smart Switch is not Working

Solution 6: Check the data volumn transferred. Connecting both devices via Smart Switch is not enough. Sometimes, even after connecting the phones and...

How to move your data to a new Android phone or iPhone

Tap the correct device icon on each phone. ... Pick the ones you want to migrate (you might not want to move brand-specific...

Move your files to a new Windows PC using an external ...

Then you'll need to safely remove the drive to ensure no files are lost or corrupted. To remove the device, select Eject before...