[BUG] generate() with do_sample isn't done on multi-GPUs Stage3 at T5ForConditionalGeneration
See original GitHub issueDescribe the bug
In multi-GPUs, finished generate()
only 1 GPU.
But, when I used 1 GPU, it works well.
This bug happens with T5ForConditionalGeneration but doesn’t happens with GPT2LMHeadModel
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:00:0B.0 Off | 0 |
| N/A 40C P0 92W / 400W | 24222MiB / 40536MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 8 NVIDIA A100-SXM... On | 00000000:80:00.0 Off | 0 |
| N/A 38C P0 87W / 400W | 20572MiB / 40536MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
dist.barrier()
is persisted.
To Reproduce
def generation_t5_from(self, encoded_example, max_tokens):
#--------------------------- Tokenizing text of prompts -------------------------------#
self.tokenizer_policy.padding_side = 'left'
prompts = self.tokenizer(
encoded_example['inputs'],
padding='longest',
truncation=True,
max_length=max_tokens,
return_tensors="pt"
).to(self.device['lm'])
self.tokenizer_policy.padding_side = 'right'
#---------------------------- Generation from the prompts ------------------------------#
generated_token_length = self.num_steps
generations = self.lm.generate(
prompts.input_ids,
max_length=generated_token_length,
do_sample=True,
)
dist.barrier()
Expected behavior
Every GPUs finished generate()
and dist.barrier()
.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/kyungmin.lee/anaconda3/envs/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/kyungmin.lee/DeepSpeed/deepspeed']
deepspeed info ................... 0.6.6+ae198e20, ae198e20, master
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU count and types: one machine with x2 A100s each
- Python version 3.8.12
- Transformers 4.19.4
Issue Analytics
- State:
- Created a year ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
[BUG][0.6.7] garbage output for multi-gpu with tutorial #2113
When running GPU = 2 started to see garbage output generated. ... I am running with 2 GPU instance with V100, also reproducible...
Read more >Efficient Training on Multiple GPUs - Hugging Face
The processing is done in parallel and all setups are synchronized at the end of each training step. TensorParallel (TP) - each tensor...
Read more >Problems with multi-gpus - MATLAB Answers - MathWorks
I have no problem training with a single gpu, but when I try to train with multiple gpus, matlab generates the following error:...
Read more >Fast Multi-GPU collectives with NCCL | NVIDIA Technical Blog
The first is that enough parallelism has not been exposed to efficiently saturate the processors. The second reason for poor scaling is that ......
Read more >Multi-GPUs and Custom Training Loops in TensorFlow 2
Checkpoint within the tf.strategy.MirroredStrategy scope. The following is unrelated to the distributed training tutorial but to make life ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
you must use
generate(..., synced_gpus=True)
when using ZeRO stage-3Solution: