[INT8] BLOOM series model loading back issue
See original GitHub issueSystem Info
8x A100 GPUs with CUDA 11.3 driver
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Use the following script to save a INT8 quantized and try to load it back.
import os
import torch
import logging
import math
from transformers import AutoConfig, pipeline, AutoModelForCausalLM, AutoTokenizer
def get_max_memory_per_gpu_dict(dtype, model_name):
"""try to generate the memory map based on what we know about the model and the available hardware"""
# figure out the memory map - the minimum per gpu required to load the model
n_gpus = torch.cuda.device_count()
try:
# model_params calculation, as we don't have a model yet to do:
# model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())
config = AutoConfig.from_pretrained(model_name)
h = config.hidden_size
l = config.n_layer
v = config.vocab_size
# from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing
model_params = l * (12 * h ** 2 + 13 * h) + v * h + 4 * h
except:
logging.info(f"The model {model_name} has a broken config file. Please notify the owner")
raise
if dtype == torch.int8:
bytes = 1
else:
bytes = torch.finfo(dtype).bits / 8
param_memory_total_in_bytes = model_params * bytes
# add 5% since weight sizes aren't the same and some GPU may need more memory
param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.10)
logging.info(f"Estimating {param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB per gpu for weights")
# check the real available memory
# load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
torch.ones(1).cuda()
max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
raise ValueError(
f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB)"
)
max_memory_per_gpu = {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
print("Max memory per gpu:", max_memory_per_gpu)
return max_memory_per_gpu
def load_model():
world_size = torch.cuda.device_count()
model_name = "bigscience/bloom"
logging.info(f"Using {world_size} gpus")
logging.info(f"Loading model {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
dtype = torch.int8
kwargs = dict(
device_map="auto",
max_memory=get_max_memory_per_gpu_dict(dtype, model_name),
)
logging.info("Using `load_in_8bit=True` to use quanitized model")
kwargs["load_in_8bit"] = True
model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
return model, tokenizer
model, tokenizer = load_model()
model.save_pretrained("int8_model/", max_shard_size="8GB")
When loading from the directory, having the error on:
RuntimeError: Only Tensors of floating point dtype can require gradients
During the initialization of the model.
import torch
import torch.distributed as dist
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
model_name = 'int8_model/'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.int8)
Expected behavior
The loading should pass. Looking for a workaround on it…
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
bigscience/bloom · Fine-tune the model? - Hugging Face
Hi everyone, If you have enough compute you could fine tune BLOOM on any downstream task but you would need enough GPU RAM...
Read more >GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale
With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
Read more >Sylvain Gugger on Twitter: "Load any HuggingFace model in ...
Load any HuggingFace model in Int8 precision and save half the memory (compared to float16/bfloat16) with just one new argument to ...
Read more >LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
PDF | Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 ......
Read more >[R] LLM.int8(): 8-bit Matrix Multiplication for Transformers at ...
Using LLM.int8(), we show empirically it is possible to perform ... In theory, it is possible to load model layer by layer from...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It’s hard to know which part fails without the whole traceback. I suspect it’s when we set the default dtype to
torch_dtype
, which only works for floating dtypes. If that’s the case, there is a probably a workaround possible by only setting the default dtype when thetorch_dtype
passed is a floating type.Also cc @younesbelkada since it’s related to int8 format.
Hi all - I’m also trying to save and re-load a BLOOM model in 8-bit format, see https://github.com/TimDettmers/bitsandbytes/issues/80.
I’m quite new to the topic and not sure I’m able to follow everything @younesbelkada mentioned, but my understanding is that this is not possible yet, is that correct?