Trainer reported loss is wrong when using DeepSpeed and gradient_accumulation_steps > 1
See original GitHub issueEnvironment info
transformers
version: 4.7.0.dev0- Platform: Windows-10-10.0.19041-SP0
- Python version: 3.8.0
- PyTorch version (GPU?): 1.8.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: no but using DeepSpeed on a single node
Who can help
@stas00, @sgugger (trainer.py)
See Also
https://github.com/microsoft/DeepSpeed/issues/1107
Information
Model I am using (Roberta)
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
- pretraining a Language Model (wikipedia and bookcorpus datasets)
To reproduce
Steps to reproduce the behavior:
- run scripts to pretrain a model with DeepSpeed on a single node with 1 GPU for N steps (gradient_accum_steps=1)
- run scripts to pretrain a model with DeepSpeed on a single node with 1 GPU for N steps (gradient_accum_steps=8)
- note that vast difference in loss reported on console by trainer.py
Expected behavior
reported loss for any number of gradient_accum_steps, nodes, or GPUs should be the mean of all losses; the same order of magnitude as shown when training with gradient_accum_steps=1, on a single node, with a single GPU.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
[BUG] DeepSpeed Zero-3 and HF trainer return hugely ...
The initial conclusion seems to be a problem with DeepSpeed ... HF trainer's final loss and perplexity vs. that of Deepspeed Zero-3 trainer....
Read more >Trainer - Hugging Face
Here is an example of how to customize Trainer to use a weighted loss (useful ... reset by a nested eval call, train...
Read more >DeepSpeed Configuration JSON
Batch size to be processed by one GPU in one step (without gradient accumulation). Can be omitted if both train_batch_size and gradient_accumulation_steps ......
Read more >Bert Model Seq2Seq Hugginface translation task
I am using: --learning_rate 5e-5 --num_train_epochs 3 --source_lang ... training, the .from_pretrained methods guarantee that only one local ...
Read more >Near-linear Scaling for Training Gigantic Model on Public Cloud
However, the computation utilization we get is still on par with the numbers reported by DeepSpeed ZeRO [45] and Megatron-LM [54] on DGX-2...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, sounds good to me.
@stas00, you are right my suggestion here is not correct. I initially thought that deepspeed code scaling by GAS and exposing the scaled value to the client (HF) was the problem. But based yours and @sgugger findings, it seems there is nothing to do if HF is fine with
deepspeed.backward()
returning the GAS-scaled loss.Sounds like this issue can be closed, once @rfernand2 agrees.