Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] DeepSpeed Inference with GPT-J using batches with padding gives wrong outputs

See original GitHub issue

Describe the bug Using DeepSpeed Inference (using deepspeed.init_inference) gives weird outputs when using batch size > 1 and padding the inputs.

I’ll first state the problem with more detail and then explain what I tried in order to narrow it down.

The problem: I’m trying to run inference with GPT-J (EleutherAI/gpt-j-6B) on a very large dataset and therefore want to achieve the highest throughput possible for my setup. I’m using a p3.16xlarge instance with 8 V100 GPUs so I can in theory fit a batch size of more than 1, since DeepSpeed helps sharing the tensors across the GPUs. Since the inputs are of different length, I have to use padding. This is how I pad (let’s assume batch_size=4 so len(input_texts) = 4):

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
tokenized_inputs = tokenizer(
            list(input_texts), 
            return_tensors='pt', 
            padding=True,
            max_length=tokenizer.model_max_length - args.max_new_tokens, 
            truncation=True,
        )

Now to the problem, assuming these are sequence lengths of each input (in number of tokens of course): idx0, 1452 idx1, 1588 idx2, 1055 idx3, 650

The outputs I get from the model will be exactly what I expect for idx1 (since it’s the longest and has no padding), very close close to what I expect for idx0, but terrible for idx3. What I “expect” is either when I run the exact same code with DeepSpeed with batch_size=1 or when I run the same code without DeepSpeed on CPU with batch_size=4. On both of these cases (DeepSpeed bsz=1 and CPU bsz=4) the outputs are identical, and they also make sense (it’s an extraction task so I can tell whether it makes sense or not).

I tried figuring out what exactly causes this problem and based on the evidence I’ve gathered I think that the sequences that have a long padding on the left side somehow accumulate a huge attention weight that is not correctly masked by the attention mask. My evidence is:

If I run with DeepSpeed bsz=4 and with torch.float16, the outputs I get are: !!! (no matter the prompt). But if I run it with torch.float32 I get “normal” outputs, but as I said, they differ from what I expect (defined above). So this makes me think some tensor overflows with f16 but not with f32. Also I should mention that running with DeepSpeed with fp16 and bsz=1 works perfectly.
The longest input in the batch (which has no padding at all) gives the expected result. Those that are close to it in length have only a slightly weird output (small amount of padding tokens). Those that are much shorter (many padding tokens) have highly unrelated output.
They way the GPT-J attention mechanism works (at least the HuggingFace implementation) is that you add -10,000 to the attention weight where the attention mask is 0. This might not be enough if the many padding tokens accumulate a large attention weight. Although when I run it on CPU with the HuggingFace implementation everything is ok so it might not be the reason.
I’m pretty sure that the culprit is this function: https://github.com/microsoft/DeepSpeed/blob/a10e4811fe78b707289132c9695bade4715fe59b/csrc/transformer/inference/csrc/softmax.cu#L203 But unfortunately I don’t speak CUDA so it’s very hard for me to follow and point exactly what the problem is. For all I know the HuggingFace implementation of attention works (https://github.com/huggingface/transformers/blob/2c2a31ffbcfe03339b1721348781aac4fc05bc5e/src/transformers/models/gptj/modeling_gptj.py#L72).

To Reproduce

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.model_max_length = 2048

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
device = torch.device(f'cuda:{local_rank}')

model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
model.config.pad_token_id = model.config.eos_token_id

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.float32,
    replace_method='auto',
    replace_with_kernel_inject=True,
)
model.device = device

tokenized_inputs = tokenizer(
    list(input_texts), 
    return_tensors='pt', 
    padding=True,
    max_length=tokenizer.model_max_length - args.max_new_tokens, 
    truncation=True,
).to(device)

with torch.inference_mode():
    batch_output_tokens = model.generate(
        input_ids=tokenized_inputs['input_ids'],
        attention_mask=tokenized_inputs['attention_mask'],
        do_sample=False,
        max_new_tokens=args.max_new_tokens,
        min_length=tokenized_inputs.input_ids.shape[1]+args.max_new_tokens,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )

batch_output_text = tokenizer.batch_decode(batch_output_tokens, skip_special_tokens=True)

Expected behavior Running DeepSpeed with batch_size=1 or batch_size=4 (or larger) should give the same outputs. Running DeepSpeed with fp16 and batch size > 1 should work and not give: “!!!”.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.6/site-packages/torch']
torch version .................... 1.10.2+cu111
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.6.0+2151c78, 2151c78, master
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.1

System info (please complete the following information): SageMaker instance p3.16xlarge with SageMaker container 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04

Launcher context Launching with deepspeed --num_gpus 8 run_inference.py

Docker context See above.

Issue Analytics

State:
Created 2 years ago
Comments:23 (8 by maintainers)

Top GitHub Comments

4reactions

RezaYazdaniAminabadicommented, Aug 19, 2022

Hi guys,

Sorry for my delay here! @codertimo Yes, you are right that the padding is not handled correctly for this model at softmax kernel. This has been fixed very recently for BLOOM model and I am gonna work on fixing it for the rest of models too. I am going to focus on this more and send a PR with a fix soon. Thanks, Reza

2reactions

trianxycommented, Sep 12, 2022

Thanks @RezaYazdaniAminabadi for fixing this!

Commit 4abd455521965930d0e921de8afc0073ea7df9d1 from the PR you mentioned fixes the problem when I tested it using a Huggingface gpt2 model. By the way: The commit aafba00c81eaf29c0c2b209a94bc31f4de942936 before that still had the bug.

I wasn’t able to test the PR on longer input sequences, though. The model seems to produce wrong/non-determenistic outputs there due to https://github.com/microsoft/DeepSpeed/issues/2243 . You mentioned that you might have a fix for that issue, too. Once you merge the fix to the latter issue, I will go ahead and test it also on the longer input sequences.

Top Results From Across the Web

DeepSpeed Integration - Hugging Face

It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit bigger models and data batches....

AI智能创作平台-openoker/DeepSpeed

In this tutorial we will be adding DeepSpeed to Megatron-LM GPT2 model, ... It shows the smallest model parallelism degree and the largest...

Accelerate GPT-J inference with DeepSpeed-Inference on GPUs

Learn how to optimize GPT-J for GPU inference with a 1-line of code using Hugging Face Transformers and DeepSpeed.

Transformers: State-of-the-Art Natural Language Processing

Big model inference You can now use the big model inference of ... of installation.mdx to Portuguese Issue #16824 by @rzimmerdev in #16979 ......

Hugging Face Transformer Inference Under 1 Millisecond ...

Tutorial to optimize NLP model and *easily* deploy it on OSS production inference server. Benchmark shows that latency is better than ...