Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flan-T5-XXL generates non-sensical text when load_in_8bit=True

See original GitHub issue

System Info

transformers version: 4.25.0.dev0
Platform: Linux-5.10.133±x86_64-with-debian-bullseye-sid
Python version: 3.7.12
Huggingface_hub version: 0.11.0
PyTorch version (GPU?): 1.12.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@patrickvonplaten

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Running the English to German example:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

produces expected output:

<pad> Wie alt sind Sie?</s>

Loading in 8-bit and running:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", load_in_8bit=True)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

results in output more nonsensical than I’d expect:

<pad> How old are</s>

Expected behavior

I expected close or approximate output between the original output and the 8-bit output. This was the provided INT8 code snippet so expected output to be sensible for task.

Issue Analytics

State:
Created 10 months ago
Comments:23 (21 by maintainers)

Top GitHub Comments

2reactions

RezaYazdaniAminabadicommented, Nov 19, 2022

There was one more scaling hack posted here if you want to try it. #14189 (comment)

In general bf16-pretrained models ought to run under bf16 or fp32 regimes, as fp16 and bf16 are very incompatible dtypes. It’s not as bad if you were to go from fp16 to bf16 as you’d only lose precision, and it’d only impact quality, but not the other way around (overflow).

So we should raise this question with the BNB and perhaps deepspeed-inference developers, at least to have an understanding of why both require fp16 and won’t support bf16.

@TimDettmers, @RezaYazdaniAminabadi - is there a way for your libraries to work with bf16 dtype, so that the bf16-pretrained models won’t overflow during inference? Thank you.

The main reason the weights are converted to half on DeepSpeed-side is that the kernels are only working with fp16 values. However, we are looking into some of these limitations and will resolve them soon. The other part is that we can quantize from the original bf16 checkpoint and resolve some of the overflow issue due to different data-precision of fp16 vs bf16.

1reaction

larsmennencommented, Dec 12, 2022

@younesbelkada moving back to this thread:

Would you mind opening a PR addressing your suggestions (patch 1 & 2 from the discussion at https://github.com/huggingface/transformers/issues/20287 )?

Yes, happy to. will get that in today or tomorrow.