Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainer reported loss is wrong when using DeepSpeed and gradient_accumulation_steps > 1

See original GitHub issue

Environment info

  • transformers version: 4.7.0.dev0
  • Platform: Windows-10-10.0.19041-SP0
  • Python version: 3.8.0
  • PyTorch version (GPU?): 1.8.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: <fill in>
  • Using distributed or parallel set-up in script?: no but using DeepSpeed on a single node

Who can help

@stas00, @sgugger (

See Also


Model I am using (Roberta)

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)
  • pretraining a Language Model (wikipedia and bookcorpus datasets)

To reproduce

Steps to reproduce the behavior:

  1. run scripts to pretrain a model with DeepSpeed on a single node with 1 GPU for N steps (gradient_accum_steps=1)
  2. run scripts to pretrain a model with DeepSpeed on a single node with 1 GPU for N steps (gradient_accum_steps=8)
  3. note that vast difference in loss reported on console by

Expected behavior

reported loss for any number of gradient_accum_steps, nodes, or GPUs should be the mean of all losses; the same order of magnitude as shown when training with gradient_accum_steps=1, on a single node, with a single GPU.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

rfernand2commented, Jun 3, 2021

Yes, sounds good to me.

tjruwasecommented, Jun 3, 2021

@stas00, you are right my suggestion here is not correct. I initially thought that deepspeed code scaling by GAS and exposing the scaled value to the client (HF) was the problem. But based yours and @sgugger findings, it seems there is nothing to do if HF is fine with deepspeed.backward() returning the GAS-scaled loss.

Sounds like this issue can be closed, once @rfernand2 agrees.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] DeepSpeed Zero-3 and HF trainer return hugely ...
The initial conclusion seems to be a problem with DeepSpeed ... HF trainer's final loss and perplexity vs. that of Deepspeed Zero-3 trainer....
Read more >
Trainer - Hugging Face
Here is an example of how to customize Trainer to use a weighted loss (useful ... reset by a nested eval call, train...
Read more >
DeepSpeed Configuration JSON
Batch size to be processed by one GPU in one step (without gradient accumulation). Can be omitted if both train_batch_size and gradient_accumulation_steps ......
Read more >
Bert Model Seq2Seq Hugginface translation task
I am using: --learning_rate 5e-5 --num_train_epochs 3 --source_lang ... training, the .from_pretrained methods guarantee that only one local ...
Read more >
Near-linear Scaling for Training Gigantic Model on Public Cloud
However, the computation utilization we get is still on par with the numbers reported by DeepSpeed ZeRO [45] and Megatron-LM [54] on DGX-2...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Post

No results found

github_iconTop Related Hashnode Post

No results found