load_checkpoint_and_dispatch "Expected all tensors to be on the same device" for > 1 GPU devices
See original GitHub issueHello all!
I am reporting an issue with the example posted here: https://colab.research.google.com/drive/14wnxMvD9zsiBQo2FtTpxn6w2cpXCcb-7#scrollTo=ZeA_LQJ3cGbL&uniqifier=1
Essentially, load_checkpoint_and_dispatch
does not seem to work if disk
and cpu
are not present, but 0
and 1
for two GPUs are present.
The device map:
{'decoder.embed_tokens': 0,
'decoder.embed_positions': 0,
'decoder.layers.0': 0,
'decoder.layers.1': 0,
'decoder.layers.2': 0,
...
'decoder.layers.27': 1,
'decoder.layers.28': 1,
'decoder.layers.29': 1,
'decoder.layers.30': 1,
'decoder.layers.31': 1}
Code for reproducing the issue:
from huggingface_hub import snapshot_download
checkpoint = 'facebook/opt-2.7b'
weights_path = snapshot_download(checkpoint)
import os
files = os.listdir(weights_path)
weights_path = os.path.join(weights_path, 'pytorch_model.bin') if 'pytorch_model.bin' in files else weights_path
from accelerate import init_empty_weights, dispatch_model, infer_auto_device_map, load_checkpoint_and_dispatch, load_checkpoint_in_model, dispatch_model
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
model.tie_weights()
max_mem = 4686198491 # 4G
device_map = infer_auto_device_map(
model.model,
max_memory={0: max_mem, 1: max_mem},
no_split_module_classes=["OPTDecoderLayer"],
dtype='float16'
)
print(device_map)
load_checkpoint_and_dispatch(
model.model,
weights_path,
device_map=device_map,
offload_folder=None,
dtype='float16',
offload_state_dict=True
)
model.tie_weights()
inputs = tokenizer("Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.", return_tensors="pt")
output = model.generate(inputs["input_ids"].to(0), max_length=50, do_sample=True)
print(tokenizer.decode(output[0].tolist()))
Issue Analytics
- State:
- Created a year ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Expected all tensors to be on the same device, but found at ...
These are only created when the model is called on an input, and are not moved to the GPU. To fix this, you...
Read more >RuntimeError: Expected all tensors to be on ... - PyTorch Forums
I am trying to run Pytorch DDP and spawning the model and dataset. I have tried transferring model, source, and targets to CUDA...
Read more >RuntimeError: Expected all tensors to ... - Hugging Face Forums
I am working in a Google Coalab session with a HuggingFace DistilBERT model which I have fine tuned against some data.
Read more >RuntimeError: Expected all tensors to be ... - Deep Graph Library
Hi! I am encountering problems when trying to send my graph to device for prediction. I do the following: device = torch.device("cuda:0" if ......
Read more >Expected all tensors to be on the same device, but found at ...
RuntimeError : Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! So I'm at...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks, I can reproduce indeed. This isn’t a bug in Accelerate but comes from the workaround that:
That’s why you have those
model.model
in this example.This will all be simplified in the coming weeks when we integrate the new tools of Accelerate inside Transformers, but for now, you can fix the issue by replacing the code at
load_checkpoint_and_dispatch
and after with:model.model
):Thanks ! @sgugger This also works for me when the inference batch size is set to one (as your code). However increasing the batch size (i.e. multiple inputs) might lead to CUDA OOM. I suppose the reason might be that the function
infer_auto_device_map
is trying to maximum the GPU usage when allocating the parameter dispatch. Do you think there is any way to support multiple input inference for speedup?