question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Difference between training_outputs with the input of training_epoch_end

See original GitHub issue

🐛 Bug

list[the output of the train_step] is different to the input of the training_epoch_end

This is my code of training step

def training_step(self, batch, batch_idx):
		cfg = self.cfg
		x = batch[cfg.keys[0]]
		y = batch[cfg.keys[1]]
		y_hat = self.net(x)
                loss = self.loss_func(y_hat,y)
                print(loss)  #1
		return {'loss': loss}
def training_epoch_end(self, outputs):
 	print(outputs)  #2

When I use GPU training the model, I have found that the answer printed is very different, and the later is nearly half of the former.

cc @justusschock @awaelchli @akihironitta @rohitgr7 @carmocca @borda @ananthsub @ninginthecloud @jjenniferdai

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
rohitgr7commented, Apr 5, 2022

oh okay… I thought the total length of outputs is half with accumulation. we normalize loss to prepare the effective gradients accordingly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

training_epoch_end log output gets combined with next epoch ...
I guess the main problem is that the code is combining the log results of run_training_epoch_end function with the results of the next...
Read more >
Training Outputs Versus Training Outcomes - LinkedIn
The distinction between outputs & outcomes is important. Outputs are measures of the process activities, such as no of people trained.
Read more >
LightningModule - PyTorch Lightning - Read the Docs
When training using a strategy that splits data from each batch across GPUs, sometimes you might need to aggregate them on the main...
Read more >
DATA690_Project_CycleGAN_S...
CycleGAN is a process for training unsupervised image translation models via the Generative Adverserial Network (GAN) architecture using unpaired collections of ...
Read more >
Input vs. Output—What Is the Right Mix for English Learners?
Can you learn English entirely by reading? Or just by chatting with native speakers? Neither—you need a mix of input and output activities....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found