question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG]GPT Models fail for long inputs and or outputs during inference

See original GitHub issue

Describe the bug

When using GPTJ or GPT Neo 2.7B with DeepSpeed inference if you give it the short simple “DeepSpeed is” like the tutorial shows, and generate only 50 tokens or so, then everything works.

However, when you give the model a long input, such as 1000 tokens or so, and or when you give a small input and want to generate many tokens, the system breaks.

Through my many tries of trying to fix the issue, I have gotten errors similar to that of #2062 where illegal memory is accessed. I have gotten errors with regards to nan/inf. Sometimes the model does not error out but rather gives garbage output once a certain length is reached, similar to that of #2233

To Reproduce Steps to reproduce the behavior:

  1. Install torch, transformers,etc
  2. Install DeepSpeed, either from source, the latest tag, or from one of the non merged PRs I reference
  3. Have a long text input in “input_data.txt”
  4. Run code below
  5. Notice bad results in some form described above

Note that when not specifying the min length, what sometimes happens is the model generates a few tokens but then stops. Specify a long min lenghth gurantees issues.

import os
import deepspeed
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
import argparse
import deepspeed
# os.environ["CUDA_LAUNCH_BLOCKING"] = '1'
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-m',"--model_name", type=str, default='EleutherAI/gpt-j-6B')
    parser.add_argument('--local_rank', type=int, default=-1,
                    help='local rank passed from distributed launcher')
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    model_name = args.model_name

    with open('input_data_long.txt', 'r') as f:
        input_text = f.read()

    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    world_size = int(os.getenv('WORLD_SIZE', '1'))

    model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=local_rank,torch_dtype=torch.float16)



    generator.model = deepspeed.init_inference(generator.model,
                                            mp_size=world_size,
                                            dtype=torch.half,
                                            replace_method='auto',
                        replace_with_kernel_inject=True)
#    torch.cuda.synchronize()

    string = generator(input_text, do_sample=True, max_length=2047)
#    torch.cuda.synchronize()

    if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
        print(string)

Expected behavior

I would expect that given 1 or multiple gpus, that one could use deepspeed inference on these GPT models with any lengh input and generate up to the max amount of tokens and get valid results.

ds_report output


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja … [OKAY]

op name … installed … compatible

cpu_adam … [YES] … [OKAY] cpu_adagrad … [YES] … [OKAY] fused_adam … [YES] … [OKAY] fused_lamb … [YES] … [OKAY] sparse_attn … [YES] … [OKAY] transformer … [YES] … [OKAY] stochastic_transformer . [YES] … [OKAY] async_io … [YES] … [OKAY] utils … [YES] … [OKAY] quantizer … [YES] … [OKAY] transformer_inference … [YES] … [OKAY]

DeepSpeed general environment info: torch install path … [‘/root/anaconda3/envs/gpt/lib/python3.9/site-packages/torch’] torch version … 1.12.0 torch cuda version … 11.3 torch hip version … None nvcc version … 11.3 deepspeed install path … [‘/root/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed’] deepspeed info … 0.7.3+89f2dedf, 89f2dedf, cholmes/fix-long-seq-len-inference deepspeed wheel compiled w. … torch 1.12, cuda 11.3

Screenshots NA

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types: 2 3090s
  • Interconnects: 1 system, 2 3090s
  • Python version: 3.9.13

Launcher context

deepspeed --num_gpus 1 infer.py

deepspeed --num_gpus 2 infer.py

Docker context

Using a nvidia cuda container with conda installed

Additional context

I believe related issues could be #2062 #2212 and related PRs could be #2212 and #2280. For the PRs I have tried building from source and it did not resolve the issue. One of them lead to fewer errors and tended to produce just poor results though(I believe it is the one specified in the ds_report).

I also tried rolling back to before 0.6.6 as I read someone had success doing so. I also tried building from master without success.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
andrewchernyhcommented, Sep 27, 2022

Hi @RezaYazdaniAminabadi, I think in real world attention mask should be passed always, because batching will not work without attention mask, for GPT it requires padding from left and correct position_ids calculation. FasterTransformer has such feature, named interactive generation that is controlled by boolean flag and I think, it will be good to have such in DeepSpeed.

0reactions
tjruwasecommented, Nov 4, 2022

Fixed, and so closing. Please (re)open if needed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Learning Structured Output Representation using Deep ...
dimensional output space as a generative model conditioned on the input observation. Building upon recent development in variational inference and learning ...
Read more >
model.predict doesn't work with Keras Custom Layer ...
The custom layer reshapes the input, then feeds it to 'modelx1' then it reshapes the output. Here is a simple model where the...
Read more >
Neural Networks — PyTorch Tutorials 1.13.1+cu117 ...
Define the neural network that has some learnable parameters (or weights). Iterate over a dataset of inputs. Process input through the network. Compute...
Read more >
How to Convert a PyTorch Model to ONNX in 5 Minutes - Deci AI
ONNX also has an inference engine package in Python that allows ... If your model has several inputs or outputs, you must simply...
Read more >
The Sequential model - Keras
Your model has multiple inputs or multiple outputs ... it has no weights (and calling model.weights results in an error stating just this)....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found