Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

load_checkpoint_and_dispatch "Expected all tensors to be on the same device" for > 1 GPU devices

See original GitHub issue

Hello all!

I am reporting an issue with the example posted here: https://colab.research.google.com/drive/14wnxMvD9zsiBQo2FtTpxn6w2cpXCcb-7#scrollTo=ZeA_LQJ3cGbL&uniqifier=1

Essentially, load_checkpoint_and_dispatch does not seem to work if disk and cpu are not present, but 0 and 1 for two GPUs are present.

The device map:

{'decoder.embed_tokens': 0,
 'decoder.embed_positions': 0,
 'decoder.layers.0': 0,
 'decoder.layers.1': 0,
 'decoder.layers.2': 0,
...
 'decoder.layers.27': 1,
 'decoder.layers.28': 1,
 'decoder.layers.29': 1,
 'decoder.layers.30': 1,
 'decoder.layers.31': 1}

Code for reproducing the issue:

from huggingface_hub import snapshot_download

checkpoint = 'facebook/opt-2.7b'
weights_path = snapshot_download(checkpoint)
import os
files = os.listdir(weights_path)
weights_path = os.path.join(weights_path, 'pytorch_model.bin') if 'pytorch_model.bin' in files else weights_path

from accelerate import init_empty_weights, dispatch_model, infer_auto_device_map, load_checkpoint_and_dispatch, load_checkpoint_in_model, dispatch_model
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint)

with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)
model.tie_weights()

max_mem = 4686198491 # 4G

device_map = infer_auto_device_map(
    model.model, 
    max_memory={0: max_mem, 1: max_mem},
    no_split_module_classes=["OPTDecoderLayer"], 
    dtype='float16'
)

print(device_map)

load_checkpoint_and_dispatch(
    model.model, 
    weights_path, 
    device_map=device_map, 
    offload_folder=None, 
    dtype='float16', 
    offload_state_dict=True
)
model.tie_weights()

inputs = tokenizer("Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.", return_tensors="pt")
output = model.generate(inputs["input_ids"].to(0), max_length=50, do_sample=True)

print(tokenizer.decode(output[0].tolist()))

Issue Analytics

State:
Created a year ago
Comments:9 (5 by maintainers)

Top GitHub Comments

12reactions

sguggercommented, May 12, 2022

Thanks, I can reproduce indeed. This isn’t a bug in Accelerate but comes from the workaround that:

the checkpoint on the Hub contains the weights of the base model
we are trying to load them in the model for causal LM

That’s why you have those model.model in this example.

This will all be simplified in the coming weeks when we integrate the new tools of Accelerate inside Transformers, but for now, you can fix the issue by replacing the code at load_checkpoint_and_dispatch and after with:

Load the weights in the model

load_checkpoint_in_model(
    model.model, 
    weights_path, 
    device_map=device_map, 
    offload_folder=None, 
    dtype='float16', 
    offload_state_dict=True
)
model.tie_weights()

Create a device_map for the full model (not model.model):

full_model_device_map = {f"model.{k}": v for k, v in device_map.items()}
full_model_device_map["lm_head"] = 0
dispatch_model(model, device_map=full_model_device_map)

Generate as usual:

inputs = tokenizer("Hugging Face is pushing the convention that a unicorn with two horns becomes a llama.", return_tensors="pt")
output = model.generate(inputs["input_ids"].to(0), max_length=50, do_sample=True)

2reactions

ccclyucommented, May 19, 2022

Thanks ! @sgugger This also works for me when the inference batch size is set to one (as your code). However increasing the batch size (i.e. multiple inputs) might lead to CUDA OOM. I suppose the reason might be that the function infer_auto_device_map is trying to maximum the GPU usage when allocating the parameter dispatch. Do you think there is any way to support multiple input inference for speedup?