[BUG][master branch] garbage GPTJ output for multi-gpu inference
See original GitHub issueDescribe the bug
Similar to #2113 this bug relates to garbage output when using multi-gpu inference. In that issue @RezaYazdaniAminabadi made a fix seen in #2198 that fixed a similar issue for GPT Neo 2.7B that after building from master I can confirm solved multi-gpu inference for GPT Neo 2.7B. However, for GPTJ the issue remains:
Output from 2 3090s for GPTJ
[{'generated_text': 'DeepSpeed is,: to,,/ &.. by and.. a\n.. and- and.. the,,\n of\n [.,.\n:, &-. and a- the,\n\n). the'}]
Meanwhile output from 1 3090 for GPTJ
[{'generated_text': 'DeepSpeed is a leading deep learning framework designed for distributed training and inference on heterogeneous accelerators and CPUs. Our paper (https://arxiv.org/abs/1811.11540) describes an optimized deep architecture and inference engine and'}]
To Reproduce Steps to reproduce the behavior:
- Install DeepSpeed from source on master
- pip install transformers
- Run with 2 GPUs to get bad output
- Run with 1 GPU to get good output
import os
import deepspeed
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = 'EleutherAI/gpt-j-6B'
# model_name = "EleutherAI/gpt-neo-2.7B"
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generator = pipeline('text-generation', model=model_name, device=local_rank,torch_dtype=torch.float16)
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.half,
replace_method='auto',
replace_with_kernel_inject=True)
string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)
Expected behavior
I would expect output that makes sense, like the output for one GPU.
ds_report output
ds_report
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja … [OKAY]
op name … installed … compatible
cpu_adam … [YES] … [OKAY] cpu_adagrad … [YES] … [OKAY] fused_adam … [YES] … [OKAY] fused_lamb … [YES] … [OKAY] sparse_attn … [YES] … [OKAY] transformer … [YES] … [OKAY] stochastic_transformer . [YES] … [OKAY] async_io … [YES] … [OKAY] utils … [YES] … [OKAY] quantizer … [YES] … [OKAY] transformer_inference … [YES] … [OKAY]
DeepSpeed general environment info: torch install path … [‘/root/anaconda3/envs/gpt/lib/python3.9/site-packages/torch’] torch version … 1.12.0 torch cuda version … 11.3 torch hip version … None nvcc version … 11.3 deepspeed install path … [‘/root/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed’] deepspeed info … 0.7.1+7d8ad45, 7d8ad45, master deepspeed wheel compiled w. … torch 1.12, cuda 11.3
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU count and types: 2 3090s
- Interconnects: 1 system, 2 3090s
- Python version: 3.9.13
I am using a docker container with Nvidia Cuda already set up as the base image.
Launcher context
deepspeed --num_gpus 2 infer.py deepspeed --num_gpus 1 infer.py
Docker context
Are you using a specific docker image that you can share? nvidia/cuda:11.3.1-devel-ubuntu20.04 then I am building python packages into the container
Additional context
NA
Issue Analytics
- State:
- Created a year ago
- Comments:8 (3 by maintainers)
Top GitHub Comments
TL/DR; based on a quick test looks good.
Hi @RezaYazdaniAminabadi,
I just switched to and built your branch
ds-inference/fix-mp2
which built deepspeed version0.7.3+9eea4ee4
Testing with a modified version of the script pasted above, e.g.
The results look much better when executed over 4 x A100 GPUs, e.g.,
Also can confirm that
deepspeed --num_gpus 1 gpt-j-6b-generation.py
with 1 x A100 GPU still works:This is with the following set-up:
Hi @skiingpacman, @mallorbc Can I ask if you get a chance to try this? I want to merge this PR asap if this works fine and fixes the issue. Thanks, Reza