Blocking issue when using deepspeed inference(maybe mutex or nccl issue)
See original GitHub issueDescription
Dear Deepspeed
I have some issue when using model parallel(inference engine) sometimes gpu utilization is fixed to 100% and code is hanged so i made test code and test deepspeed engine. here’s my test code.
TestCode
import os
import deepspeed
import torch
import transformers
from transformers import pipeline, AutoTokenizer
def init():
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline(
'text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.float,
replace_method='auto')
return generator
def predict(text, max_len):
top_k = 50
temperature = 1.0
top_p = 1.0
return_seq = 1
string = generator(text, do_sample=True, min_length=50, max_length=max_len, top_k=top_k, temperature=temperature, top_p=top_p, num_return_sequences=return_seq,
pad_token_id=3)
if torch.distributed.get_rank() == 0:
print(string)
if __name__ == '__main__':
generator = init()
text = 'a'
seq = 2023
for i in range(2, seq):
print(f'##### max_len: {i}')
predict(text, i)
DS_Report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+c3d40fd
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.4.3, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3
ENV
- NGC(21.06) docker(pytorch, nccl, cuda)
- pip install deepspeed
issue is occur when input length is reached to 90 token. (may be it’s randomly determined)
thank you
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (6 by maintainers)
Top Results From Across the Web
[BUG][0.6.7]DeepSpeed is trying to load the whole model to 1 ...
My GPU have 40GB memory each (A100) and my CPU memory is 488GB. However, I am trying to load the OPT-30B model with...
Read more >DeepSpeed Deep Dive — Model Implementations for ...
The DeepSpeed team has recently released a new open-source library called Model Implementation for Inference (MII), aimed towards making ...
Read more >DeepSpeed: Accelerating large-scale model inference and ...
Multi-GPU inference with DeepSpeed for large-scale Transformer models; Compressed training with Progressive Layer Dropping: 2.5x faster ...
Read more >Enabling Efficient Inference of Transformer Models at ... - arXiv
DeepSpeed Inference can automatically scale a dense transformer model to multiple devices by partitioning transformer operators across multiple ...
Read more >tensor flow安装- OSCHINA - 中文开源技术交流社区
... handle empty tensor #3933fix distribute_clone sbp #3803Reshape backward issue with distribute ... Use thread_unsafe_size to avoid acquiring mutex lock.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @RezaYazdaniAminabadi
I try to reproduce 10 times ([increase token 145 to 2022] * 10) with your fixed deepspeed repo(0.4.6+5038b07, 5038b07, reyazda/mp_inference)
issue is not reproduce.
There is a slight difference in inference speed when the barrier() added to the code and when not added. But the difference is up to xxx ms in long sequence generation, so it seems to be minor point.
thanks
Hi @RezaYazdaniAminabadi
Of course. I will let you know the results after testing.
Thanks