Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[INT8] BLOOM series model loading back issue

See original GitHub issue

System Info

8x A100 GPUs with CUDA 11.3 driver

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Use the following script to save a INT8 quantized and try to load it back.

import os
import torch
import logging
import math
from transformers import AutoConfig, pipeline, AutoModelForCausalLM, AutoTokenizer

def get_max_memory_per_gpu_dict(dtype, model_name):
    """try to generate the memory map based on what we know about the model and the available hardware"""

    # figure out the memory map - the minimum per gpu required to load the model
    n_gpus = torch.cuda.device_count()

    try:
        # model_params calculation, as we don't have a model yet to do:
        # model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())

        config = AutoConfig.from_pretrained(model_name)
        h = config.hidden_size
        l = config.n_layer
        v = config.vocab_size
        # from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing
        model_params = l * (12 * h ** 2 + 13 * h) + v * h + 4 * h
    except:
        logging.info(f"The model {model_name} has a broken config file. Please notify the owner")
        raise

    if dtype == torch.int8:
        bytes = 1
    else:
        bytes = torch.finfo(dtype).bits / 8
    param_memory_total_in_bytes = model_params * bytes
    # add 5% since weight sizes aren't the same and some GPU may need more memory
    param_memory_per_gpu_in_bytes = int(param_memory_total_in_bytes / n_gpus * 1.10)
    logging.info(f"Estimating {param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB per gpu for weights")

    # check the real available memory
    # load cuda kernels first and only measure the real free memory after loading (shorter by ~2GB)
    torch.ones(1).cuda()
    max_memory_per_gpu_in_bytes = torch.cuda.mem_get_info(0)[0]
    if max_memory_per_gpu_in_bytes < param_memory_per_gpu_in_bytes:
        raise ValueError(
            f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes / 2 ** 30:0.2f}GB)"
        )

    max_memory_per_gpu = {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
    print("Max memory per gpu:", max_memory_per_gpu)
    return max_memory_per_gpu


def load_model():
    world_size = torch.cuda.device_count()
    model_name = "bigscience/bloom"
    logging.info(f"Using {world_size} gpus")
    logging.info(f"Loading model {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    dtype = torch.int8
    kwargs = dict(
        device_map="auto",
        max_memory=get_max_memory_per_gpu_dict(dtype, model_name),
    )
    logging.info("Using `load_in_8bit=True` to use quanitized model")
    kwargs["load_in_8bit"] = True
    model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs)
    return model, tokenizer

model, tokenizer = load_model()

model.save_pretrained("int8_model/", max_shard_size="8GB")

When loading from the directory, having the error on:

RuntimeError: Only Tensors of floating point dtype can require gradients

During the initialization of the model.

import torch
import torch.distributed as dist
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

model_name = 'int8_model/'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.int8)

Expected behavior

The loading should pass. Looking for a workaround on it…

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Oct 14, 2022

It’s hard to know which part fails without the whole traceback. I suspect it’s when we set the default dtype to torch_dtype, which only works for floating dtypes. If that’s the case, there is a probably a workaround possible by only setting the default dtype when the torch_dtype passed is a floating type.

Also cc @younesbelkada since it’s related to int8 format.

0reactions

marshmellow77commented, Nov 12, 2022

Hi all - I’m also trying to save and re-load a BLOOM model in 8-bit format, see https://github.com/TimDettmers/bitsandbytes/issues/80.

I’m quite new to the topic and not sure I’m able to follow everything @younesbelkada mentioned, but my understanding is that this is not possible yet, is that correct?