Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] generate() with do_sample isn't done on multi-GPUs Stage3 at T5ForConditionalGeneration

See original GitHub issue

Describe the bug In multi-GPUs, finished generate() only 1 GPU.

But, when I used 1 GPU, it works well.

This bug happens with T5ForConditionalGeneration but doesn’t happens with GPT2LMHeadModel

+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:00:0B.0 Off |                    0 |
| N/A   40C    P0    92W / 400W |  24222MiB / 40536MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   8  NVIDIA A100-SXM...  On   | 00000000:80:00.0 Off |                    0 |
| N/A   38C    P0    87W / 400W |  20572MiB / 40536MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

dist.barrier() is persisted.

To Reproduce

    def generation_t5_from(self, encoded_example, max_tokens):
        #--------------------------- Tokenizing text of prompts -------------------------------#

        self.tokenizer_policy.padding_side = 'left'
        prompts = self.tokenizer( 
            encoded_example['inputs'],
            padding='longest',
            truncation=True,
            max_length=max_tokens,
            return_tensors="pt"
        ).to(self.device['lm'])
        self.tokenizer_policy.padding_side = 'right'

        #---------------------------- Generation from the prompts ------------------------------#

        generated_token_length = self.num_steps
        generations = self.lm.generate(
            prompts.input_ids,
            max_length=generated_token_length,
            do_sample=True,
        )
        dist.barrier()

Expected behavior Every GPUs finished generate() and dist.barrier().

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/kyungmin.lee/anaconda3/envs/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/kyungmin.lee/DeepSpeed/deepspeed']
deepspeed info ................... 0.6.6+ae198e20, ae198e20, master
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: one machine with x2 A100s each
Python version 3.8.12
Transformers 4.19.4

Issue Analytics

State:
Created a year ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Jul 26, 2022

you must use generate(..., synced_gpus=True) when using ZeRO stage-3

0reactions

lkm2835commented, Jul 26, 2022

Solution:

    gen = engine.generate(
                prompts.input_ids,
                max_length=128,
                do_sample=True,
                synced_gpus=True
                )