question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Inference with batch size > 1 and long inputs

See original GitHub issue

Describe the bug Responses for transformers models are not relevant with long inputs and batch size > 1. This issue is related to gpt-like models, while this issue corresponds to Roberta. In the example code I used a small gpt2 model in order to test everything, while an issue exists with large models such as GPT-J, similar issue with OPT.

Please contact me for more details if needed. Or if I can help somehow with resolving this issue, this will be my pleasure to help.

To Reproduce Code to reproduce:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import deepspeed

model_id = "gpt2"

VERBOSE = True
BATCH_SIZE = 4

EXAMPLE = "DeepSpeed-Training\n" \
          "DeepSpeed offers a confluence of system innovations, that has made large " \
          "scale DL training effective, and efficient, greatly improved ease of use, " \
          "and redefined the DL training landscape in terms of scale that is possible. " \
          "These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, " \
          "etc. fall under the training pillar. Learn more: DeepSpeed-Training\n" \
          "DeepSpeed-Inference\n" \
          "DeepSpeed brings together innovations in parallelism technology such as tensor, " \
          "pipeline, expert and ZeRO-parallelism, and combines them with high performance " \
          "custom inference kernels, communication optimizations and heterogeneous memory " \
          "technologies to enable inference at an unprecedented scale, while achieving " \
          "unparalleled latency, throughput and cost reduction. This systematic composition " \
          "of system technologies for inference falls under the inference pillar. " \
          "Learn more: DeepSpeed-Inference\n" \
          "Model Implementations for Inference (MII)\n" \
          "Model Implementations for Inference (MII) is an open-sourced repository " \
          "for making low-latency and high-throughput inference accessible to all " \
          "data scientists by alleviating the need to apply complex system optimization " \
          "techniques themselves. Out-of-box, MII offers support for thousands of " \
          "widely used DL models, optimized using DeepSpeed-Inference, that can be " \
          "deployed with a few lines of code, while achieving significant latency " \
          "reduction compared to their vanilla open-sourced versions.\n" \
          "DeepSpeed on Azure" \
          "\nDeepSpeed users are diverse and have access to different environments. " \
          "We recommend to try DeepSpeed on Azure as it is the simplest and easiest " \
          "method. The recommended method to try DeepSpeed on Azure is through AzureML " \
          "recipes. The job submission and data preparation scripts have been made " \
          "available here. For more details on how to use DeepSpeed on Azure, please " \
          "follow the Azure tutorial."

GENERATION_KWARGS = {
    "max_new_tokens": 4,
    'do_sample': False,
}

torch_model = AutoModelForCausalLM.from_pretrained(model_id).half().eval().to(0)

tokenizer = AutoTokenizer.from_pretrained(model_id)


def call_model(model, input_text, batch_size, desc="", verbose=False):
    assert batch_size > 0

    inputs = tokenizer([input_text] * batch_size, return_tensors='pt').to(0)

    if verbose:
        print(desc)
        print(f"Batch size: {batch_size}")
        print(f"Input size: {inputs.input_ids.size()}")

    outputs = model.generate(**inputs, **GENERATION_KWARGS)
    outputs_set = list()
    output = None
    for i, output in enumerate(outputs):
        text_output = tokenizer.decode(output)
        output = text_output[len(input_text):]
        outputs_set.append(output)
        if verbose:
            print(f"#{i}: {output}")
    # assert len(set(outputs_set)) == 1 # raises with DeepSpeed inference
    return output


call_model(
    model=torch_model,
    input_text=EXAMPLE,
    batch_size=BATCH_SIZE,
    desc="Torch",
    verbose=VERBOSE
)

ds_model = deepspeed.init_inference(
    model=torch_model,
    mp_size=1,
    dtype=torch.float16,
    replace_method="auto",
    replace_with_kernel_inject=True,
)

call_model(
    model=ds_model,
    input_text=EXAMPLE,
    batch_size=BATCH_SIZE,
    desc="Deepspeed",
    verbose=VERBOSE
)

Ouput:

Torch
Batch size: 4
Input size: torch.Size([4, 370])
#0: 
DeepSpeed on
#1: 
DeepSpeed on
#2: 
DeepSpeed on
#3: 
DeepSpeed on

Deepspeed
Batch size: 4
Input size: torch.Size([4, 370])
#0: 
DeepSpeed on
#1: 


-
#2: 
DeepSpeed-
#3: 
---

Expected behavior With do_sample=False expected to get all the same outputs with the same inputs in the batch. While only the first output is relevant and matches the torch output. Check the output section under the example code.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.4+99326438, 99326438, master
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

System info (please complete the following information):

  • OS: Ubuntu 20.04.2 LTS
  • GPU count and types: 1x A100
  • Python version: 3.8.10
  • Used main branch of DeepSpeed repository

Launcher context

deepspeed main.py

Docker context

FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
...
RUN pip3 install --no-cache-dir git+https://github.com/microsoft/DeepSpeed.git

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
RezaYazdaniAminabadicommented, Oct 20, 2022

Can you please try this PR to see if the issue is resolved? Thanks, Reza

1reaction
AlekseyKorshukcommented, Sep 26, 2022

I might need to direct this issue to appropriate contributor for fast reply: @RezaYazdaniAminabadi, @cmikeh2.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] DeepSpeed Inference with GPT-J using batches ...
Using DeepSpeed Inference (using deepspeed.init_inference ) gives weird outputs when using batch size > 1 and padding the inputs.
Read more >
Inference fails after batch size of 32 - TAO Toolkit
I am trying to do inference at varying batch sizes and it appears that after a batch size of 32 the inference fails...
Read more >
Batch normalization when batch size=1
Because I am using 3D medical images as training dataset, the batch size can only be set to 1 because of GPU limitation....
Read more >
Optimize your inference jobs using dynamic batch ...
To optimize model deployment for higher throughput, the general guideline is to increase the batch size until throughput decreases.
Read more >
6. Custom Service — PyTorch/Serve master documentation
:param batch: list of raw requests, should match batch size :return: list of preprocessed model input data """ # Take the input data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found