[BUG] DeepSpeed Inference with GPT-J using batches with padding gives wrong outputs
See original GitHub issueDescribe the bug
Using DeepSpeed Inference (using deepspeed.init_inference
) gives weird outputs when using batch size > 1 and padding the inputs.
I’ll first state the problem with more detail and then explain what I tried in order to narrow it down.
The problem:
I’m trying to run inference with GPT-J (EleutherAI/gpt-j-6B
) on a very large dataset and therefore want to achieve the highest throughput possible for my setup. I’m using a p3.16xlarge
instance with 8 V100 GPUs so I can in theory fit a batch size of more than 1, since DeepSpeed helps sharing the tensors across the GPUs.
Since the inputs are of different length, I have to use padding. This is how I pad (let’s assume batch_size=4
so len(input_texts)
= 4):
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
tokenized_inputs = tokenizer(
list(input_texts),
return_tensors='pt',
padding=True,
max_length=tokenizer.model_max_length - args.max_new_tokens,
truncation=True,
)
Now to the problem, assuming these are sequence lengths of each input (in number of tokens of course): idx0, 1452 idx1, 1588 idx2, 1055 idx3, 650
The outputs I get from the model will be exactly what I expect for idx1 (since it’s the longest and has no padding), very close close to what I expect for idx0, but terrible for idx3.
What I “expect” is either when I run the exact same code with DeepSpeed with batch_size=1
or when I run the same code without DeepSpeed on CPU with batch_size=4
.
On both of these cases (DeepSpeed bsz=1 and CPU bsz=4) the outputs are identical, and they also make sense (it’s an extraction task so I can tell whether it makes sense or not).
I tried figuring out what exactly causes this problem and based on the evidence I’ve gathered I think that the sequences that have a long padding on the left side somehow accumulate a huge attention weight that is not correctly masked by the attention mask. My evidence is:
- If I run with DeepSpeed bsz=4 and with
torch.float16
, the outputs I get are: !!! (no matter the prompt). But if I run it withtorch.float32
I get “normal” outputs, but as I said, they differ from what I expect (defined above). So this makes me think some tensor overflows with f16 but not with f32. Also I should mention that running with DeepSpeed with fp16 and bsz=1 works perfectly. - The longest input in the batch (which has no padding at all) gives the expected result. Those that are close to it in length have only a slightly weird output (small amount of padding tokens). Those that are much shorter (many padding tokens) have highly unrelated output.
- They way the GPT-J attention mechanism works (at least the HuggingFace implementation) is that you add -10,000 to the attention weight where the attention mask is 0. This might not be enough if the many padding tokens accumulate a large attention weight. Although when I run it on CPU with the HuggingFace implementation everything is ok so it might not be the reason.
- I’m pretty sure that the culprit is this function: https://github.com/microsoft/DeepSpeed/blob/a10e4811fe78b707289132c9695bade4715fe59b/csrc/transformer/inference/csrc/softmax.cu#L203 But unfortunately I don’t speak CUDA so it’s very hard for me to follow and point exactly what the problem is. For all I know the HuggingFace implementation of attention works (https://github.com/huggingface/transformers/blob/2c2a31ffbcfe03339b1721348781aac4fc05bc5e/src/transformers/models/gptj/modeling_gptj.py#L72).
To Reproduce
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.model_max_length = 2048
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
device = torch.device(f'cuda:{local_rank}')
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
model.config.pad_token_id = model.config.eos_token_id
model = deepspeed.init_inference(
model,
mp_size=world_size,
dtype=torch.float32,
replace_method='auto',
replace_with_kernel_inject=True,
)
model.device = device
tokenized_inputs = tokenizer(
list(input_texts),
return_tensors='pt',
padding=True,
max_length=tokenizer.model_max_length - args.max_new_tokens,
truncation=True,
).to(device)
with torch.inference_mode():
batch_output_tokens = model.generate(
input_ids=tokenized_inputs['input_ids'],
attention_mask=tokenized_inputs['attention_mask'],
do_sample=False,
max_new_tokens=args.max_new_tokens,
min_length=tokenized_inputs.input_ids.shape[1]+args.max_new_tokens,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
batch_output_text = tokenizer.batch_decode(batch_output_tokens, skip_special_tokens=True)
Expected behavior
Running DeepSpeed with batch_size=1
or batch_size=4
(or larger) should give the same outputs.
Running DeepSpeed with fp16 and batch size > 1 should work and not give: “!!!”.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.6/site-packages/torch']
torch version .................... 1.10.2+cu111
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.6.0+2151c78, 2151c78, master
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.1
System info (please complete the following information):
SageMaker instance p3.16xlarge
with SageMaker container 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
Launcher context
Launching with deepspeed --num_gpus 8 run_inference.py
Docker context See above.
Issue Analytics
- State:
- Created 2 years ago
- Comments:23 (8 by maintainers)
Top GitHub Comments
Hi guys,
Sorry for my delay here! @codertimo Yes, you are right that the padding is not handled correctly for this model at softmax kernel. This has been fixed very recently for BLOOM model and I am gonna work on fixing it for the rest of models too. I am going to focus on this more and send a PR with a fix soon. Thanks, Reza
Thanks @RezaYazdaniAminabadi for fixing this!
Commit 4abd455521965930d0e921de8afc0073ea7df9d1 from the PR you mentioned fixes the problem when I tested it using a Huggingface
gpt2
model. By the way: The commit aafba00c81eaf29c0c2b209a94bc31f4de942936 before that still had the bug.I wasn’t able to test the PR on longer input sequences, though. The model seems to produce wrong/non-determenistic outputs there due to https://github.com/microsoft/DeepSpeed/issues/2243 . You mentioned that you might have a fix for that issue, too. Once you merge the fix to the latter issue, I will go ahead and test it also on the longer input sequences.