Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Blocking issue when using deepspeed inference(maybe mutex or nccl issue)

See original GitHub issue

Description

Dear Deepspeed

I have some issue when using model parallel(inference engine) sometimes gpu utilization is fixed to 100% and code is hanged so i made test code and test deepspeed engine. here’s my test code.

TestCode

import os
import deepspeed
import torch
import transformers
from transformers import pipeline, AutoTokenizer

def init():
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    world_size = int(os.getenv('WORLD_SIZE', '1'))
    generator = pipeline(
        'text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank)
    generator.model = deepspeed.init_inference(generator.model,
                                            mp_size=world_size,
                                            dtype=torch.float,
                                            replace_method='auto')
    return generator

def predict(text, max_len):
    top_k = 50
    temperature = 1.0
    top_p = 1.0
    return_seq = 1
    string = generator(text, do_sample=True, min_length=50, max_length=max_len, top_k=top_k, temperature=temperature, top_p=top_p, num_return_sequences=return_seq,
                            pad_token_id=3)
    if torch.distributed.get_rank() == 0:
        print(string)


if __name__ == '__main__':
    generator = init()
    text = 'a'
    seq = 2023
    for i in range(2, seq):
        print(f'##### max_len: {i}')
        predict(text, i)

DS_Report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+c3d40fd
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.4.3, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3

ENV

NGC(21.06) docker(pytorch, nccl, cuda)
pip install deepspeed

issue is occur when input length is reached to 90 token. (may be it’s randomly determined)

thank you

Issue Analytics

State:
Created 2 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

2reactions

switizcommented, Aug 19, 2021

Hi @RezaYazdaniAminabadi

I try to reproduce 10 times ([increase token 145 to 2022] * 10) with your fixed deepspeed repo(0.4.6+5038b07, 5038b07, reyazda/mp_inference)

issue is not reproduce.

There is a slight difference in inference speed when the barrier() added to the code and when not added. But the difference is up to xxx ms in long sequence generation, so it seems to be minor point.

thanks

0reactions

switizcommented, Aug 17, 2021

Hi @RezaYazdaniAminabadi

Of course. I will let you know the results after testing.

Thanks

Top Results From Across the Web

[BUG][0.6.7]DeepSpeed is trying to load the whole model to 1 ...

My GPU have 40GB memory each (A100) and my CPU memory is 488GB. However, I am trying to load the OPT-30B model with...

DeepSpeed Deep Dive — Model Implementations for ...

The DeepSpeed team has recently released a new open-source library called Model Implementation for Inference (MII), aimed towards making ...

DeepSpeed: Accelerating large-scale model inference and ...

Multi-GPU inference with DeepSpeed for large-scale Transformer models; Compressed training with Progressive Layer Dropping: 2.5x faster ...

Enabling Efficient Inference of Transformer Models at ... - arXiv

DeepSpeed Inference can automatically scale a dense transformer model to multiple devices by partitioning transformer operators across multiple ...

tensor flow安装- OSCHINA - 中文开源技术交流社区

... handle empty tensor #3933fix distribute_clone sbp #3803Reshape backward issue with distribute ... Use thread_unsafe_size to avoid acquiring mutex lock.