[BUG] GPT-J + init_inference + replace_with_kernel_inject returns copy error with multiple GPUs
See original GitHub issueDescribe the bug
Using the replace_with_kernel_inject
option in init_inference
returns an error when using multiple GPUs (with a GPT-J model).
To Reproduce Steps to reproduce the behavior:
- Create an inference script using HF Transformers and GPT-J
- Run the deepspeed command with multiple GPUs
import os
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import deepspeed
from transformers import pipeline as t_pipeline
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
generator = t_pipeline('text-generation', model=model, tokenizer=tokenizer, eos_token_id=50256, device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.float16,
replace_method= 'auto',
replace_with_kernel_inject= True
)
input_list = ["This is the input "]
res_ds = generator(input_list, do_sample=True, max_length = 1000, eos_token_id=50256, temperature=0.25, pad_token_id=50257)
Expected behavior No error.
ds_report output Unavailable, not currently in the compute node.
Screenshots
System info (please complete the following information):
- OS: Linux - Ubuntu
- One machine with 8x A100 40gb PCIE
- Python 3.8
- Using the following docker image:
pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel
Launcher context Deepspeed command line
Docker context
Base image is: pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel
Additional context
- The problem does not exist when
replace_with_kernel_inject
is set to False - Things work fine with
replace_with_kernel_inject = True
and running the script directly with a single GPU. - The error appears to come from here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_module.py#L74
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (5 by maintainers)
Top Results From Across the Web
Splitting GPT-J(And Other NLP Models) Over Multiple GPUs
As language models get larger, it becomes harder and harder to run them on normal consumer hardware. One way to get around this...
Read more >Multi-GPU Programming - NVIDIA
Managing multiple GPUs from a single CPU thread. • CUDA calls are issued to the current GPU. – Exception: peer-to-peer memcopies.
Read more >GPU programming in CUDA: Using multiple GPUs
Kernel launches are asynchronous. ▷ do some cpu work is executed concurrently with the kernel. cudaMemcpy waits for the kernel to complete and...
Read more >Run MATLAB Functions on Multiple GPUs - MathWorks
This example shows how to run MATLAB® code on multiple GPUs in parallel, first on your local machine, then scaling up to a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @TiesdeKok I am also facing the garbage output issue. Not sure if it is related to the issue you were having previously: https://github.com/microsoft/DeepSpeed/issues/2113
Hi @TiesdeKok, I think taking a look on this issue I opened might be relevant to your use case: https://github.com/microsoft/DeepSpeed/issues/1797 I think it at least explains why you got the exclamation marks outputs and also probably raise your attention regarding the outputs you’re getting in case you pad some of your inputs.