question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flan-T5-XXL generates non-sensical text when load_in_8bit=True

See original GitHub issue

System Info

  • transformers version: 4.25.0.dev0
  • Platform: Linux-5.10.133±x86_64-with-debian-bullseye-sid
  • Python version: 3.7.12
  • Huggingface_hub version: 0.11.0
  • PyTorch version (GPU?): 1.12.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Running the English to German example:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

produces expected output:

<pad> Wie alt sind Sie?</s>

Loading in 8-bit and running:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", load_in_8bit=True)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

results in output more nonsensical than I’d expect:

<pad> How old are</s>

Expected behavior

I expected close or approximate output between the original output and the 8-bit output. This was the provided INT8 code snippet so expected output to be sensible for task.

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:23 (21 by maintainers)

github_iconTop GitHub Comments

2reactions
RezaYazdaniAminabadicommented, Nov 19, 2022

There was one more scaling hack posted here if you want to try it. #14189 (comment)

In general bf16-pretrained models ought to run under bf16 or fp32 regimes, as fp16 and bf16 are very incompatible dtypes. It’s not as bad if you were to go from fp16 to bf16 as you’d only lose precision, and it’d only impact quality, but not the other way around (overflow).

So we should raise this question with the BNB and perhaps deepspeed-inference developers, at least to have an understanding of why both require fp16 and won’t support bf16.

@TimDettmers, @RezaYazdaniAminabadi - is there a way for your libraries to work with bf16 dtype, so that the bf16-pretrained models won’t overflow during inference? Thank you.

The main reason the weights are converted to half on DeepSpeed-side is that the kernels are only working with fp16 values. However, we are looking into some of these limitations and will resolve them soon. The other part is that we can quantize from the original bf16 checkpoint and resolve some of the overflow issue due to different data-precision of fp16 vs bf16.

1reaction
larsmennencommented, Dec 12, 2022

@younesbelkada moving back to this thread:

Would you mind opening a PR addressing your suggestions (patch 1 & 2 from the discussion at https://github.com/huggingface/transformers/issues/20287 )?

Yes, happy to. will get that in today or tomorrow.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found