question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[INT8] BLOOM series model loading back issue

See original GitHub issue

System Info

8x A100 GPUs with CUDA 11.3 driver

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Use the following script to save a INT8 quantized and try to load it back.

import os
import torch
import logging
import math
from transformers import AutoConfig, pipeline, AutoModelForCausalLM, AutoTokenizer

def get_max_memory_per_gpu_dict(dtype, model_name):
    """try to generate the memory map based on what we know about the model and the available hardware"""

    # figure out the memory map - the minimum per gpu required to load the model
    n_gpus = torch.cuda.device_count()

    try:
        # model_params calculation, as we don't have a model yet to do:
        # model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())

        config = AutoConfig.from_pretrained(model_name)
        h = config.hidden_size
        l = config.n_layer
        v = config.vocab_size
        # from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing
        model_params = l * (12 * h ** 2 + 13 * h) + v * h + 4 * h
    except:
        logging.info(f"The model {model_name} has a broken config file. Please notify the owner")
        raise

    if dtype == torch.int8:
        bytes = 1
    else:
        bytes = torch.finfo(dtype).bits / 8
    param_memory_total_in_bytes = model_params * bytes
    # add 5% since weight sizes aren't the same and some GPU may need more memory
    param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.10)
    logging.info(f"Estimating {param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB per gpu for weights")

    # check the real available memory
    # load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
    torch.ones(1).cuda()
    max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
    if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
        raise ValueError(
            f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB)"
        )

    max_memory_per_gpu = {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
    print("Max memory per gpu:", max_memory_per_gpu)
    return max_memory_per_gpu


def load_model():
    world_size = torch.cuda.device_count()
    model_name = "bigscience/bloom"
    logging.info(f"Using {world_size} gpus")
    logging.info(f"Loading model {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    dtype = torch.int8
    kwargs = dict(
        device_map="auto",
        max_memory=get_max_memory_per_gpu_dict(dtype, model_name),
    )
    logging.info("Using `load_in_8bit=True` to use quanitized model")
    kwargs["load_in_8bit"] = True
    model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
    return model, tokenizer

model, tokenizer = load_model()

model.save_pretrained("int8_model/", max_shard_size="8GB")

When loading from the directory, having the error on:

RuntimeError: Only Tensors of floating point dtype can require gradients

During the initialization of the model.

import torch
import torch.distributed as dist
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_name = 'int8_model/'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.int8)

Expected behavior

The loading should pass. Looking for a workaround on it…

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Oct 14, 2022

It’s hard to know which part fails without the whole traceback. I suspect it’s when we set the default dtype to torch_dtype, which only works for floating dtypes. If that’s the case, there is a probably a workaround possible by only setting the default dtype when the torch_dtype passed is a floating type.

Also cc @younesbelkada since it’s related to int8 format.

0reactions
marshmellow77commented, Nov 12, 2022

Hi all - I’m also trying to save and re-load a BLOOM model in 8-bit format, see https://github.com/TimDettmers/bitsandbytes/issues/80.

I’m quite new to the topic and not sure I’m able to follow everything @younesbelkada mentioned, but my understanding is that this is not possible yet, is that correct?

Read more comments on GitHub >

github_iconTop Results From Across the Web

bigscience/bloom · Fine-tune the model? - Hugging Face
Hi everyone, If you have enough compute you could fine tune BLOOM on any downstream task but you would need enough GPU RAM...
Read more >
GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale
With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
Read more >
Sylvain Gugger on Twitter: "Load any HuggingFace model in ...
Load any HuggingFace model in Int8 precision and save half the memory (compared to float16/bfloat16) with just one new argument to ...
Read more >
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
PDF | Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 ......
Read more >
[R] LLM.int8(): 8-bit Matrix Multiplication for Transformers at ...
Using LLM.int8(), we show empirically it is possible to perform ... In theory, it is possible to load model layer by layer from...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found