Error when training with multiple target/losses on multiple GPUs
See original GitHub issue- PyTorch-Forecasting version: 0.8.4
- PyTorch version: I couldn’t find this in my poetry.lock??
- Python version: 3.8
- Operating System: Linux
Expected behavior
Just to work as normal
Actual behavior
Error is: RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Traceback
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 161, in new_process
results = trainer.train_or_test_or_predict()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 556, in train_or_test_or_predict
results = self.run_train()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
self.train_loop.run_training_epoch()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 654, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 425, in optimizer_step
model_ref.optimizer_step(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/lightning.py", line 1390, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 277, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 282, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 163, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_forecasting/optim.py", line 131, in step
_ = closure()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 648, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 755, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 784, in backward
result.closure_loss = self.trainer.accelerator.backward(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 256, in backward
output = self.precision_plugin.backward(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 71, in backward
model.backward(closure_loss, optimizer, opt_idx)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/lightning.py", line 1251, in backward
loss.backward(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Code to reproduce the problem
import warnings
import os
import numpy as np
import pandas as pd
import torch
from pandas.core.common import SettingWithCopyWarning
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_forecasting import (
TemporalFusionTransformer,
TimeSeriesDataSet,
)
from pytorch_forecasting.metrics import MultiLoss, QuantileLoss
warnings.simplefilter("error", category=SettingWithCopyWarning)
torch.set_printoptions(sci_mode=False)
CPUS = os.cpu_count()
TRAINING_SIZE = 1000
VALIDATION_SIZE = 100
MAX_PREDICTION_LENGTH = 5
def train(function):
data = create_dataframes(0, TRAINING_SIZE, function)
batch_size = 16
max_encoder_length = MAX_PREDICTION_LENGTH * 10
training_cutoff = data["time_idx"].max() - MAX_PREDICTION_LENGTH
training = TimeSeriesDataSet(
data[lambda x: x.time_idx <= training_cutoff],
time_idx="time_idx",
target=["target1", "target2"],
group_ids=["group"],
min_encoder_length=max_encoder_length // 2,
max_encoder_length=max_encoder_length,
max_prediction_length=MAX_PREDICTION_LENGTH,
time_varying_known_reals=["base"],
time_varying_unknown_reals=["target1", "target2"],
# Explicitly defined normalizers
target_normalizer=MultiNormalizer(
normalizers=[
EncoderNormalizer(transformation="relu"),
EncoderNormalizer(transformation="relu"),
]
),
)
train_dataloader = training.to_dataloader(
train=True, batch_size=batch_size, num_workers=CPUS
)
validation = TimeSeriesDataSet.from_dataset(
training, data, min_prediction_idx=training_cutoff + 1, stop_randomization=True
)
val_dataloader = validation.to_dataloader(
train=False, batch_size=batch_size * 10, num_workers=CPUS
)
early_stop_callback = EarlyStopping(
monitor="val_loss", min_delta=1e-4, patience=3, verbose=False, mode="min"
)
lr_logger = LearningRateMonitor()
trainer = pl.Trainer(
max_epochs=10,
gpus=3,
weights_summary="top",
gradient_clip_val=0.1,
callbacks=[lr_logger, early_stop_callback],
)
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=0.01,
hidden_size=5,
attention_head_size=1,
dropout=0.1,
hidden_continuous_size=8,
output_size=[7, 7],
loss=MultiLoss([QuantileLoss(), QuantileLoss()]),
log_interval=10,
log_val_interval=1,
reduce_on_plateau_patience=3,
)
trainer.fit(
tft,
train_dataloader=train_dataloader,
val_dataloaders=val_dataloader,
)
def create_dataframes(start, size, functions):
base = np.arange(start, start + size, 1.0)
target1 = np.vectorize(functions[0])(base)
target2 = np.vectorize(functions[1])(base)
return pd.DataFrame(
dict(
base=base,
target1=target1,
target2=target2,
group=np.full(size, 0),
time_idx=np.arange(0, size, 1),
)
)
def quadratic1(x):
return (x * x) + (7 * x) + 3
def quadratic2(x):
return 2 * (x * x) + (8 * x) + 4
if __name__ == "__main__":
train([quadratic1, quadratic2])
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
multiple gpus? #181 - eriklindernoren/PyTorch-YOLOv3 - GitHub
I met the same error, and I think the problem comes from here. loss, outputs = model(imgs, targets) while i use 4 gpus,...
Read more >Runtime cudnn error when using Dataparallel to run training ...
I having problem running training on Multiple Gpu when using Dataparallel. The code works fine when only one Gpu is used for training....
Read more >Training on Multiple GPUs causes NaN Validation Errors in ...
I'm using a fit_generator and make a call to a validation generator. The values returned for training losses and validation losses when training...
Read more >Speeding up Neural Network Training With Multiple GPUs and ...
multiple GPUs aren't just speeding up your results, but also you can avoid out of memory errors. So, if your problem is sufficiently...
Read more >LightningModule - PyTorch Lightning - Read the Docs
Tensor(2, 3) x = x.cuda() x = x.to(device) # do this instead x = x # leave ... When training using a strategy...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, and it’s the same error.
BTW, I forgot to say thank you for this project 😃
Same error with @tombh when using multiple target with TFT model and “ddp” accelerator (2 gpus), but if I just change the target to single one, it is OK. Any progress here?