Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainer reported loss is wrong when using DeepSpeed and gradient_accumulation_steps > 1

See original GitHub issue

Environment info

transformers version: 4.7.0.dev0
Platform: Windows-10-10.0.19041-SP0
Python version: 3.8.0
PyTorch version (GPU?): 1.8.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: <fill in>
Using distributed or parallel set-up in script?: no but using DeepSpeed on a single node

Who can help

@stas00, @sgugger (trainer.py)

Information

Model I am using (Roberta)

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)
pretraining a Language Model (wikipedia and bookcorpus datasets)

To reproduce

Steps to reproduce the behavior:

run scripts to pretrain a model with DeepSpeed on a single node with 1 GPU for N steps (gradient_accum_steps=1)
run scripts to pretrain a model with DeepSpeed on a single node with 1 GPU for N steps (gradient_accum_steps=8)
note that vast difference in loss reported on console by trainer.py

Expected behavior

reported loss for any number of gradient_accum_steps, nodes, or GPUs should be the mean of all losses; the same order of magnitude as shown when training with gradient_accum_steps=1, on a single node, with a single GPU.

Issue Analytics

State:
Created 2 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

rfernand2commented, Jun 3, 2021

Yes, sounds good to me.

1reaction

tjruwasecommented, Jun 3, 2021

@stas00, you are right my suggestion here is not correct. I initially thought that deepspeed code scaling by GAS and exposing the scaled value to the client (HF) was the problem. But based yours and @sgugger findings, it seems there is nothing to do if HF is fine with deepspeed.backward() returning the GAS-scaled loss.

Sounds like this issue can be closed, once @rfernand2 agrees.

Top Results From Across the Web

[BUG] DeepSpeed Zero-3 and HF trainer return hugely ...

The initial conclusion seems to be a problem with DeepSpeed ... HF trainer's final loss and perplexity vs. that of Deepspeed Zero-3 trainer....

Trainer - Hugging Face

Here is an example of how to customize Trainer to use a weighted loss (useful ... reset by a nested eval call, train...

DeepSpeed Configuration JSON

Batch size to be processed by one GPU in one step (without gradient accumulation). Can be omitted if both train_batch_size and gradient_accumulation_steps ......

Bert Model Seq2Seq Hugginface translation task

I am using: --learning_rate 5e-5 --num_train_epochs 3 --source_lang ... training, the .from_pretrained methods guarantee that only one local ...

Near-linear Scaling for Training Gigantic Model on Public Cloud

However, the computation utilization we get is still on par with the numbers reported by DeepSpeed ZeRO [45] and Megatron-LM [54] on DGX-2...