Flan-T5-XXL generates non-sensical text when load_in_8bit=True
See original GitHub issueSystem Info
transformers
version: 4.25.0.dev0- Platform: Linux-5.10.133±x86_64-with-debian-bullseye-sid
- Python version: 3.7.12
- Huggingface_hub version: 0.11.0
- PyTorch version (GPU?): 1.12.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Running the English to German example:
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto")
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
produces expected output:
<pad> Wie alt sind Sie?</s>
Loading in 8-bit and running:
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", load_in_8bit=True)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
results in output more nonsensical than I’d expect:
<pad> How old are</s>
Expected behavior
I expected close or approximate output between the original output and the 8-bit output. This was the provided INT8 code snippet so expected output to be sensible for task.
Issue Analytics
- State:
- Created 10 months ago
- Comments:23 (21 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The main reason the weights are converted to half on DeepSpeed-side is that the kernels are only working with fp16 values. However, we are looking into some of these limitations and will resolve them soon. The other part is that we can quantize from the original bf16 checkpoint and resolve some of the overflow issue due to different data-precision of fp16 vs bf16.
@younesbelkada moving back to this thread:
Yes, happy to. will get that in today or tomorrow.