question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 generate with do_sample doesn't work on DeepSpeed Stage 3

See original GitHub issue

System Info

transformers == 4.20.1 python == 3.8.13 OS == ubuntu 20.4 DeepSpeed == 0.6.7

Who can help?

@patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

https://github.com/microsoft/DeepSpeed/issues/2022#issuecomment-1158389764

Expected behavior

All processes run and finish.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
stas00commented, Jul 26, 2022

oops, apologies for the typo - fixed! glad you figured it out, @lkm2835

This has nothing to do with t5 specifically, but just how ZeRO stage3 works. It needs to have all gpus work in sync. So if one gpu finished generating, it has to continue running forward because ZeRO distributes all the weight shards to all gpus and if one stops the other gpus can’t get the shards they are missing.

So it really depends on the situations - sometimes all gpus generate the same output length and then it works w/o syncing, but that’s just an accident and can easily break down the road.

For more details please see: https://huggingface.co/docs/transformers/main/perf_train_gpu_many#zero-data-parallelism

1reaction
stas00commented, Jul 26, 2022

@lkm2835, looking at your code you linked to, you must use generate(..., synced_gpus=True) when using ZeRO stage-3

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed Integration - Hugging Face
DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but it doesn't use an optimizer...
Read more >
[Deepspeed ZeRO-3] Broken model save on fresh ... - GitHub
The problem with DeepSpeed is that it doesn't currently have a way to save a ... So, to load stage-3 checkpoint I should...
Read more >
Enabling Efficient Inference of Transformer Models at ... - arXiv
high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3× over....
Read more >
DeepSpeed - Release 0.7.6 Microsoft
Note: this approach may not work if your application doesn't have sufficient free CPU memory and you may need to use the offline...
Read more >
DeepSpeed: Accelerating large-scale model inference and ...
Last month, the DeepSpeed Team announced ZeRO-Infinity, a step forward in ... Figure 3: Inference latency for the open-source models with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found