Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Difference between training_outputs with the input of training_epoch_end

See original GitHub issue

🐛 Bug

list[the output of the train_step] is different to the input of the training_epoch_end

This is my code of training step

def training_step(self, batch, batch_idx):
		cfg = self.cfg
		x = batch[cfg.keys[0]]
		y = batch[cfg.keys[1]]
		y_hat = self.net(x)
                loss = self.loss_func(y_hat,y)
                print(loss)  #1
		return {'loss': loss}

def training_epoch_end(self, outputs):
 	print(outputs)  #2

When I use GPU training the model, I have found that the answer printed is very different, and the later is nearly half of the former.

cc @justusschock @awaelchli @akihironitta @rohitgr7 @carmocca @borda @ananthsub @ninginthecloud @jjenniferdai

Issue Analytics

State:
Created a year ago
Comments:11 (6 by maintainers)

Top GitHub Comments

2reactions

carmoccacommented, Apr 5, 2022

It’s likely caused by this https://github.com/PyTorchLightning/pytorch-lightning/blob/184518c2fab188a9679a5b9d73ba95e3a8097280/pytorch_lightning/loops/optimization/optimizer_loop.py#L89 where normalize is accumulate_grad_batches https://github.com/PyTorchLightning/pytorch-lightning/blob/184518c2fab188a9679a5b9d73ba95e3a8097280/pytorch_lightning/loops/optimization/optimizer_loop.py#L436-L438

1reaction

rohitgr7commented, Apr 5, 2022

oh okay… I thought the total length of outputs is half with accumulation. we normalize loss to prepare the effective gradients accordingly.

Top Results From Across the Web

training_epoch_end log output gets combined with next epoch ...

I guess the main problem is that the code is combining the log results of run_training_epoch_end function with the results of the next...

Training Outputs Versus Training Outcomes - LinkedIn

The distinction between outputs & outcomes is important. Outputs are measures of the process activities, such as no of people trained.

LightningModule - PyTorch Lightning - Read the Docs

When training using a strategy that splits data from each batch across GPUs, sometimes you might need to aggregate them on the main...

DATA690_Project_CycleGAN_S...

CycleGAN is a process for training unsupervised image translation models via the Generative Adverserial Network (GAN) architecture using unpaired collections of ...

Input vs. Output—What Is the Right Mix for English Learners?

Can you learn English entirely by reading? Or just by chatting with native speakers? Neither—you need a mix of input and output activities....