question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

load_checkpoint_and_dispatch "Expected all tensors to be on the same device" for > 1 GPU devices

See original GitHub issue

Hello all!

I am reporting an issue with the example posted here: https://colab.research.google.com/drive/14wnxMvD9zsiBQo2FtTpxn6w2cpXCcb-7#scrollTo=ZeA_LQJ3cGbL&uniqifier=1

Essentially, load_checkpoint_and_dispatch does not seem to work if disk and cpu are not present, but 0 and 1 for two GPUs are present.

The device map:

{'decoder.embed_tokens': 0,
 'decoder.embed_positions': 0,
 'decoder.layers.0': 0,
 'decoder.layers.1': 0,
 'decoder.layers.2': 0,
...
 'decoder.layers.27': 1,
 'decoder.layers.28': 1,
 'decoder.layers.29': 1,
 'decoder.layers.30': 1,
 'decoder.layers.31': 1}

Code for reproducing the issue:

from huggingface_hub import snapshot_download

checkpoint = 'facebook/opt-2.7b'
weights_path = snapshot_download(checkpoint)
import os
files = os.listdir(weights_path)
weights_path = os.path.join(weights_path, 'pytorch_model.bin') if 'pytorch_model.bin' in files else weights_path

from accelerate import init_empty_weights, dispatch_model, infer_auto_device_map, load_checkpoint_and_dispatch, load_checkpoint_in_model, dispatch_model
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint)

with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)
model.tie_weights()

max_mem = 4686198491 # 4G

device_map = infer_auto_device_map(
    model.model, 
    max_memory={0: max_mem, 1: max_mem},
    no_split_module_classes=["OPTDecoderLayer"], 
    dtype='float16'
)

print(device_map)

load_checkpoint_and_dispatch(
    model.model, 
    weights_path, 
    device_map=device_map, 
    offload_folder=None, 
    dtype='float16', 
    offload_state_dict=True
)
model.tie_weights()

inputs = tokenizer("Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.", return_tensors="pt")
output = model.generate(inputs["input_ids"].to(0), max_length=50, do_sample=True)

print(tokenizer.decode(output[0].tolist()))

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

12reactions
sguggercommented, May 12, 2022

Thanks, I can reproduce indeed. This isn’t a bug in Accelerate but comes from the workaround that:

  • the checkpoint on the Hub contains the weights of the base model
  • we are trying to load them in the model for causal LM

That’s why you have those model.model in this example.

This will all be simplified in the coming weeks when we integrate the new tools of Accelerate inside Transformers, but for now, you can fix the issue by replacing the code at load_checkpoint_and_dispatch and after with:

  1. Load the weights in the model
load_checkpoint_in_model(
    model.model, 
    weights_path, 
    device_map=device_map, 
    offload_folder=None, 
    dtype='float16', 
    offload_state_dict=True
)
model.tie_weights()
  1. Create a device_map for the full model (not model.model):
full_model_device_map = {f"model.{k}": v for k, v in device_map.items()}
full_model_device_map["lm_head"] = 0
dispatch_model(model, device_map=full_model_device_map)
  1. Generate as usual:
inputs = tokenizer("Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.", return_tensors="pt")
output = model.generate(inputs["input_ids"].to(0), max_length=50, do_sample=True)
2reactions
ccclyucommented, May 19, 2022

Thanks ! @sgugger This also works for me when the inference batch size is set to one (as your code). However increasing the batch size (i.e. multiple inputs) might lead to CUDA OOM. I suppose the reason might be that the function infer_auto_device_map is trying to maximum the GPU usage when allocating the parameter dispatch. Do you think there is any way to support multiple input inference for speedup?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Expected all tensors to be on the same device, but found at ...
These are only created when the model is called on an input, and are not moved to the GPU. To fix this, you...
Read more >
RuntimeError: Expected all tensors to be on ... - PyTorch Forums
I am trying to run Pytorch DDP and spawning the model and dataset. I have tried transferring model, source, and targets to CUDA...
Read more >
RuntimeError: Expected all tensors to ... - Hugging Face Forums
I am working in a Google Coalab session with a HuggingFace DistilBERT model which I have fine tuned against some data.
Read more >
RuntimeError: Expected all tensors to be ... - Deep Graph Library
Hi! I am encountering problems when trying to send my graph to device for prediction. I do the following: device = torch.device("cuda:0" if ......
Read more >
Expected all tensors to be on the same device, but found at ...
RuntimeError : Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! So I'm at...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found