question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

🚀 Feature request

I would like to use BART in FP16 mode, but it seems impossible for now :

config = BartConfig(vocab_size=50264, output_past=True)
model = AutoModelWithLMHead.from_pretrained('bart-large-cnn', config=config).cuda().half()
tokenizer = AutoTokenizer.from_pretrained('bart-large-cnn')
ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs."
inputs = tokenizer.batch_encode_plus([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
generated_ids = model.generate(inputs['input_ids'].cuda(), attention_mask=inputs['attention_mask'].cuda(), num_beams=4, max_length=5)

File “/data/user/.venv/bartqg/lib/python3.6/site-packages/transformers/modeling_bart.py”, line 647, in forward attn_output = torch.bmm(attn_probs, v) RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #2 ‘mat2’ in call to _th_bmm

@sshleifer Do you plan to implement a FP16-friendly version of BART ?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
thomwolfcommented, Mar 5, 2020

This should not be closed indeed.

@sshleifer, we intend all the models to be compatible with FP16, this is the direction the field is going and with the Volta-level GPU being widespread now, there is less and less reason not to use mixed-precision fine-tuning (half memory and significantly faster).

1reaction
sshleifercommented, Mar 5, 2020

Yep, on it!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Does using FP16 help accelerate ... - Hugging Face Forums
Basically, I'm using BART in HuggingFace for generation. During the training phase, I'm able to get 2x speedup and less GPU memory ...
Read more >
Does using FP16 help accelerate generation? (HuggingFace ...
Basically, I'm using BART in HuggingFace for generation. During the training phase, I'm able to get 2x speedup and less GPU memory ...
Read more >
Speeding up training — ParlAI Documentation
If you have access to an NVIDIA GPU with FP16 CUDA Cores (V100, GTX 2080, etc), then you can get large speedups by...
Read more >
Command-line Tools — fairseq 0.12.2 documentation
if set, the floating point conversion to fp16/bf16 runs on CPU. This reduces bus transfer time and GPU memory usage. Default: False. --min-loss-scale....
Read more >
TensorRT 8.4.1 Release Notes - NVIDIA Documentation Center
There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found