[BUG] Inference kernels don't handle Huggingface attention_mask correctly
See original GitHub issueDescribe the bug
When I use DeepSpeed’s inference kernels with Huggingface transformers and pass an in an attention_mask
that masks out some tokens, the mask affects the output in strange ways. In my repro code this results in very different logits; when sampling this results in garbage samples.
To Reproduce Run the following code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import deepspeed
DEEPSPEED = True
device = torch.device("cuda")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.to(device)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
if DEEPSPEED:
ds_engine = deepspeed.init_inference(
model, mp_size=1, dtype=torch.half, checkpoint=None, replace_method="auto"
)
model = ds_engine.module
text = ["This is a test sentence."]
no_padding = tokenizer(text)
no_padding_logits = model(torch.tensor(no_padding["input_ids"], device=device)).logits
with_padding = tokenizer(text, padding="max_length", max_length=32)
with_padding_logits = model(
torch.tensor(with_padding["input_ids"], device=device),
attention_mask=torch.tensor(with_padding["attention_mask"], device=device),
).logits
difference = torch.max(
torch.abs(no_padding_logits - with_padding_logits[:, : no_padding_logits.shape[1]])
).item()
print(f"Max difference: {difference:.2g}")
assert difference <= 2e-4
I get:
Max difference: 0.25
Traceback (most recent call last):
File "/home/ubuntu/unity/adversarial/test_deepspeed_inference.py", line 38, in <module>
assert difference <= 2e-4
AssertionError
Expected behavior
The assertion should pass. Adding masked-out padding tokens after the tokens in question should not dramatically shift the output. Indeed, when I run with DEEPSPEED = False
, the max difference is only 0.00011
.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/.asdf/installs/python/3.9.6/lib/python3.9/site-packages/torch']
torch version .................... 1.9.0+cu111
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/ubuntu/.asdf/installs/python/3.9.6/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.5.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.1
System info (please complete the following information):
- OS: Ubuntu 18.04
- GPU: one V100-16GB
- Interconnects: n/a
- Python version: 3.9.6
Launcher context Just running directly: one process on one GPU.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Troubleshoot - Hugging Face
Troubleshoot. Sometimes errors occur, but we are here to help! This guide covers some of the most common issues we've seen and how...
Read more >Glossary - Hugging Face
The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them....
Read more >Optimization story: Bloom inference - Hugging Face
This article gives you the behind-the-scenes of how we made an efficient inference server that powers bloom. inference server that powers ...
Read more >What to do when you get an error - Hugging Face Course
Oh no, something seems to have gone wrong! If you're new to programming, these kind of errors can seem a bit cryptic at...
Read more >Handling big models - Hugging Face
This way, you model can run for inference even if it doesn't fit on one of the GPUs or the CPU RAM! This...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @daniel-ziegler for testing this and happy to see the issue is solved 😃
Oh, and with
.half()
for the baseline, as you pointed out:So DeepSpeed looks pretty good after this fix.