Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Inference kernels don't handle Huggingface attention_mask correctly

See original GitHub issue

Describe the bug When I use DeepSpeed’s inference kernels with Huggingface transformers and pass an in an attention_mask that masks out some tokens, the mask affects the output in strange ways. In my repro code this results in very different logits; when sampling this results in garbage samples.

To Reproduce Run the following code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import deepspeed

DEEPSPEED = True

device = torch.device("cuda")

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.to(device)

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

if DEEPSPEED:
    ds_engine = deepspeed.init_inference(
        model, mp_size=1, dtype=torch.half, checkpoint=None, replace_method="auto"
    )
    model = ds_engine.module

text = ["This is a test sentence."]
no_padding = tokenizer(text)

no_padding_logits = model(torch.tensor(no_padding["input_ids"], device=device)).logits

with_padding = tokenizer(text, padding="max_length", max_length=32)
with_padding_logits = model(
    torch.tensor(with_padding["input_ids"], device=device),
    attention_mask=torch.tensor(with_padding["attention_mask"], device=device),
).logits

difference = torch.max(
    torch.abs(no_padding_logits - with_padding_logits[:, : no_padding_logits.shape[1]])
).item()

print(f"Max difference: {difference:.2g}")
assert difference <= 2e-4

I get:

Max difference: 0.25
Traceback (most recent call last):
  File "/home/ubuntu/unity/adversarial/test_deepspeed_inference.py", line 38, in <module>
    assert difference <= 2e-4
AssertionError

Expected behavior The assertion should pass. Adding masked-out padding tokens after the tokens in question should not dramatically shift the output. Indeed, when I run with DEEPSPEED = False, the max difference is only 0.00011.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/.asdf/installs/python/3.9.6/lib/python3.9/site-packages/torch']
torch version .................... 1.9.0+cu111
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/ubuntu/.asdf/installs/python/3.9.6/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.5.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.1

System info (please complete the following information):

OS: Ubuntu 18.04
GPU: one V100-16GB
Interconnects: n/a
Python version: 3.9.6

Launcher context Just running directly: one process on one GPU.

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

RezaYazdaniAminabadicommented, Oct 29, 2021

Thanks @daniel-ziegler for testing this and happy to see the issue is solved 😃

1reaction

daniel-zieglercommented, Oct 28, 2021

Oh, and with .half() for the baseline, as you pointed out:

no deepspeed
Max difference: 0.11
Mean difference: 0.032

So DeepSpeed looks pretty good after this fix.

Top Results From Across the Web

Troubleshoot - Hugging Face

Troubleshoot. Sometimes errors occur, but we are here to help! This guide covers some of the most common issues we've seen and how...

Glossary - Hugging Face

The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them....

Optimization story: Bloom inference - Hugging Face

This article gives you the behind-the-scenes of how we made an efficient inference server that powers bloom. inference server that powers ...

What to do when you get an error - Hugging Face Course

Oh no, something seems to have gone wrong! If you're new to programming, these kind of errors can seem a bit cryptic at...

Handling big models - Hugging Face

This way, you model can run for inference even if it doesn't fit on one of the GPUs or the CPU RAM! This...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[BUG] Inference kernels don't handle Huggingface attention_mask correctly

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[BUG] Post Layer Normalization model gradients not matching

[BUG] Offload memory usage not performing as expected