question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when training with multiple target/losses on multiple GPUs

See original GitHub issue
  • PyTorch-Forecasting version: 0.8.4
  • PyTorch version: I couldn’t find this in my poetry.lock??
  • Python version: 3.8
  • Operating System: Linux

Expected behavior

Just to work as normal

Actual behavior

Error is: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Traceback
-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 161, in new_process
    results = trainer.train_or_test_or_predict()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 556, in train_or_test_or_predict
    results = self.run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 654, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 425, in optimizer_step
    model_ref.optimizer_step(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/lightning.py", line 1390, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 277, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 282, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 163, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_forecasting/optim.py", line 131, in step
    _ = closure()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 648, in train_step_and_backward_closure
    result = self.training_step_and_backward(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 755, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 784, in backward
    result.closure_loss = self.trainer.accelerator.backward(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 256, in backward
    output = self.precision_plugin.backward(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 71, in backward
    model.backward(closure_loss, optimizer, opt_idx)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/lightning.py", line 1251, in backward
    loss.backward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Code to reproduce the problem

import warnings
import os

import numpy as np
import pandas as pd
import torch
from pandas.core.common import SettingWithCopyWarning

import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_forecasting import (
    TemporalFusionTransformer,
    TimeSeriesDataSet,
)
from pytorch_forecasting.metrics import MultiLoss, QuantileLoss

warnings.simplefilter("error", category=SettingWithCopyWarning)
torch.set_printoptions(sci_mode=False)

CPUS = os.cpu_count()
TRAINING_SIZE = 1000
VALIDATION_SIZE = 100
MAX_PREDICTION_LENGTH = 5


def train(function):
    data = create_dataframes(0, TRAINING_SIZE, function)

    batch_size = 16
    max_encoder_length = MAX_PREDICTION_LENGTH * 10
    training_cutoff = data["time_idx"].max() - MAX_PREDICTION_LENGTH

    training = TimeSeriesDataSet(
        data[lambda x: x.time_idx <= training_cutoff],
        time_idx="time_idx",
        target=["target1", "target2"],
        group_ids=["group"],
        min_encoder_length=max_encoder_length // 2,
        max_encoder_length=max_encoder_length,
        max_prediction_length=MAX_PREDICTION_LENGTH,
        time_varying_known_reals=["base"],
        time_varying_unknown_reals=["target1", "target2"],
        # Explicitly defined normalizers
        target_normalizer=MultiNormalizer(
            normalizers=[
                EncoderNormalizer(transformation="relu"),
                EncoderNormalizer(transformation="relu"),
            ]
        ),
    )
    train_dataloader = training.to_dataloader(
        train=True, batch_size=batch_size, num_workers=CPUS
    )

    validation = TimeSeriesDataSet.from_dataset(
        training, data, min_prediction_idx=training_cutoff + 1, stop_randomization=True
    )
    val_dataloader = validation.to_dataloader(
        train=False, batch_size=batch_size * 10, num_workers=CPUS
    )

    early_stop_callback = EarlyStopping(
        monitor="val_loss", min_delta=1e-4, patience=3, verbose=False, mode="min"
    )

    lr_logger = LearningRateMonitor()

    trainer = pl.Trainer(
        max_epochs=10,
        gpus=3,
        weights_summary="top",
        gradient_clip_val=0.1,
        callbacks=[lr_logger, early_stop_callback],
    )

    tft = TemporalFusionTransformer.from_dataset(
        training,
        learning_rate=0.01,
        hidden_size=5,
        attention_head_size=1,
        dropout=0.1,
        hidden_continuous_size=8,
        output_size=[7, 7],
        loss=MultiLoss([QuantileLoss(), QuantileLoss()]),
        log_interval=10,
        log_val_interval=1,
        reduce_on_plateau_patience=3,
    )

    trainer.fit(
        tft,
        train_dataloader=train_dataloader,
        val_dataloaders=val_dataloader,
    )


def create_dataframes(start, size, functions):
    base = np.arange(start, start + size, 1.0)
    target1 = np.vectorize(functions[0])(base)
    target2 = np.vectorize(functions[1])(base)
    return pd.DataFrame(
        dict(
            base=base,
            target1=target1,
            target2=target2,
            group=np.full(size, 0),
            time_idx=np.arange(0, size, 1),
        )
    )


def quadratic1(x):
    return (x * x) + (7 * x) + 3


def quadratic2(x):
    return 2 * (x * x) + (8 * x) + 4


if __name__ == "__main__":
    train([quadratic1, quadratic2])

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
tombhcommented, Apr 5, 2021

Yes, and it’s the same error.

BTW, I forgot to say thank you for this project 😃

1reaction
LumingSuncommented, Nov 11, 2021

Same error with @tombh when using multiple target with TFT model and “ddp” accelerator (2 gpus), but if I just change the target to single one, it is OK. Any progress here?

Read more comments on GitHub >

github_iconTop Results From Across the Web

multiple gpus? #181 - eriklindernoren/PyTorch-YOLOv3 - GitHub
I met the same error, and I think the problem comes from here. loss, outputs = model(imgs, targets) while i use 4 gpus,...
Read more >
Runtime cudnn error when using Dataparallel to run training ...
I having problem running training on Multiple Gpu when using Dataparallel. The code works fine when only one Gpu is used for training....
Read more >
Training on Multiple GPUs causes NaN Validation Errors in ...
I'm using a fit_generator and make a call to a validation generator. The values returned for training losses and validation losses when training...
Read more >
Speeding up Neural Network Training With Multiple GPUs and ...
multiple GPUs aren't just speeding up your results, but also you can avoid out of memory errors. So, if your problem is sufficiently...
Read more >
LightningModule - PyTorch Lightning - Read the Docs
Tensor(2, 3) x = x.cuda() x = x.to(device) # do this instead x = x # leave ... When training using a strategy...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found