[BUG] Inference with batch size > 1 and long inputs
See original GitHub issueDescribe the bug Responses for transformers models are not relevant with long inputs and batch size > 1. This issue is related to gpt-like models, while this issue corresponds to Roberta. In the example code I used a small gpt2 model in order to test everything, while an issue exists with large models such as GPT-J, similar issue with OPT.
Please contact me for more details if needed. Or if I can help somehow with resolving this issue, this will be my pleasure to help.
To Reproduce Code to reproduce:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import deepspeed
model_id = "gpt2"
VERBOSE = True
BATCH_SIZE = 4
EXAMPLE = "DeepSpeed-Training\n" \
"DeepSpeed offers a confluence of system innovations, that has made large " \
"scale DL training effective, and efficient, greatly improved ease of use, " \
"and redefined the DL training landscape in terms of scale that is possible. " \
"These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, " \
"etc. fall under the training pillar. Learn more: DeepSpeed-Training\n" \
"DeepSpeed-Inference\n" \
"DeepSpeed brings together innovations in parallelism technology such as tensor, " \
"pipeline, expert and ZeRO-parallelism, and combines them with high performance " \
"custom inference kernels, communication optimizations and heterogeneous memory " \
"technologies to enable inference at an unprecedented scale, while achieving " \
"unparalleled latency, throughput and cost reduction. This systematic composition " \
"of system technologies for inference falls under the inference pillar. " \
"Learn more: DeepSpeed-Inference\n" \
"Model Implementations for Inference (MII)\n" \
"Model Implementations for Inference (MII) is an open-sourced repository " \
"for making low-latency and high-throughput inference accessible to all " \
"data scientists by alleviating the need to apply complex system optimization " \
"techniques themselves. Out-of-box, MII offers support for thousands of " \
"widely used DL models, optimized using DeepSpeed-Inference, that can be " \
"deployed with a few lines of code, while achieving significant latency " \
"reduction compared to their vanilla open-sourced versions.\n" \
"DeepSpeed on Azure" \
"\nDeepSpeed users are diverse and have access to different environments. " \
"We recommend to try DeepSpeed on Azure as it is the simplest and easiest " \
"method. The recommended method to try DeepSpeed on Azure is through AzureML " \
"recipes. The job submission and data preparation scripts have been made " \
"available here. For more details on how to use DeepSpeed on Azure, please " \
"follow the Azure tutorial."
GENERATION_KWARGS = {
"max_new_tokens": 4,
'do_sample': False,
}
torch_model = AutoModelForCausalLM.from_pretrained(model_id).half().eval().to(0)
tokenizer = AutoTokenizer.from_pretrained(model_id)
def call_model(model, input_text, batch_size, desc="", verbose=False):
assert batch_size > 0
inputs = tokenizer([input_text] * batch_size, return_tensors='pt').to(0)
if verbose:
print(desc)
print(f"Batch size: {batch_size}")
print(f"Input size: {inputs.input_ids.size()}")
outputs = model.generate(**inputs, **GENERATION_KWARGS)
outputs_set = list()
output = None
for i, output in enumerate(outputs):
text_output = tokenizer.decode(output)
output = text_output[len(input_text):]
outputs_set.append(output)
if verbose:
print(f"#{i}: {output}")
# assert len(set(outputs_set)) == 1 # raises with DeepSpeed inference
return output
call_model(
model=torch_model,
input_text=EXAMPLE,
batch_size=BATCH_SIZE,
desc="Torch",
verbose=VERBOSE
)
ds_model = deepspeed.init_inference(
model=torch_model,
mp_size=1,
dtype=torch.float16,
replace_method="auto",
replace_with_kernel_inject=True,
)
call_model(
model=ds_model,
input_text=EXAMPLE,
batch_size=BATCH_SIZE,
desc="Deepspeed",
verbose=VERBOSE
)
Ouput:
Torch
Batch size: 4
Input size: torch.Size([4, 370])
#0:
DeepSpeed on
#1:
DeepSpeed on
#2:
DeepSpeed on
#3:
DeepSpeed on
Deepspeed
Batch size: 4
Input size: torch.Size([4, 370])
#0:
DeepSpeed on
#1:
-
#2:
DeepSpeed-
#3:
---
Expected behavior
With do_sample=False
expected to get all the same outputs with the same inputs in the batch. While only the first output is relevant and matches the torch output.
Check the output section under the example code.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.4+99326438, 99326438, master
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
System info (please complete the following information):
- OS: Ubuntu 20.04.2 LTS
- GPU count and types: 1x A100
- Python version: 3.8.10
- Used main branch of DeepSpeed repository
Launcher context
deepspeed main.py
Docker context
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
...
RUN pip3 install --no-cache-dir git+https://github.com/microsoft/DeepSpeed.git
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:10 (3 by maintainers)
Top GitHub Comments
Can you please try this PR to see if the issue is resolved? Thanks, Reza
I might need to direct this issue to appropriate contributor for fast reply: @RezaYazdaniAminabadi, @cmikeh2.