Difference between training_outputs with the input of training_epoch_end
See original GitHub issue🐛 Bug
list[the output of the train_step] is different to the input of the training_epoch_end
This is my code of training step
def training_step(self, batch, batch_idx):
cfg = self.cfg
x = batch[cfg.keys[0]]
y = batch[cfg.keys[1]]
y_hat = self.net(x)
loss = self.loss_func(y_hat,y)
print(loss) #1
return {'loss': loss}
def training_epoch_end(self, outputs):
print(outputs) #2
When I use GPU training the model, I have found that the answer printed is very different, and the later is nearly half of the former.
cc @justusschock @awaelchli @akihironitta @rohitgr7 @carmocca @borda @ananthsub @ninginthecloud @jjenniferdai
Issue Analytics
- State:
- Created a year ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
training_epoch_end log output gets combined with next epoch ...
I guess the main problem is that the code is combining the log results of run_training_epoch_end function with the results of the next...
Read more >Training Outputs Versus Training Outcomes - LinkedIn
The distinction between outputs & outcomes is important. Outputs are measures of the process activities, such as no of people trained.
Read more >LightningModule - PyTorch Lightning - Read the Docs
When training using a strategy that splits data from each batch across GPUs, sometimes you might need to aggregate them on the main...
Read more >DATA690_Project_CycleGAN_S...
CycleGAN is a process for training unsupervised image translation models via the Generative Adverserial Network (GAN) architecture using unpaired collections of ...
Read more >Input vs. Output—What Is the Right Mix for English Learners?
Can you learn English entirely by reading? Or just by chatting with native speakers? Neither—you need a mix of input and output activities....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It’s likely caused by this https://github.com/PyTorchLightning/pytorch-lightning/blob/184518c2fab188a9679a5b9d73ba95e3a8097280/pytorch_lightning/loops/optimization/optimizer_loop.py#L89 where normalize is
accumulate_grad_batches
https://github.com/PyTorchLightning/pytorch-lightning/blob/184518c2fab188a9679a5b9d73ba95e3a8097280/pytorch_lightning/loops/optimization/optimizer_loop.py#L436-L438oh okay… I thought the total length of outputs is half with accumulation. we normalize loss to prepare the effective gradients accordingly.