Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 generate with do_sample doesn't work on DeepSpeed Stage 3

See original GitHub issue

System Info

transformers == 4.20.1 python == 3.8.13 OS == ubuntu 20.4 DeepSpeed == 0.6.7

Who can help?

@patrickvonplaten

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

https://github.com/microsoft/DeepSpeed/issues/2022#issuecomment-1158389764

Expected behavior

All processes run and finish.

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

2reactions

stas00commented, Jul 26, 2022

oops, apologies for the typo - fixed! glad you figured it out, @lkm2835

This has nothing to do with t5 specifically, but just how ZeRO stage3 works. It needs to have all gpus work in sync. So if one gpu finished generating, it has to continue running forward because ZeRO distributes all the weight shards to all gpus and if one stops the other gpus can’t get the shards they are missing.

So it really depends on the situations - sometimes all gpus generate the same output length and then it works w/o syncing, but that’s just an accident and can easily break down the road.

For more details please see: https://huggingface.co/docs/transformers/main/perf_train_gpu_many#zero-data-parallelism

1reaction

stas00commented, Jul 26, 2022

@lkm2835, looking at your code you linked to, you must use generate(..., synced_gpus=True) when using ZeRO stage-3