question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support fp16 for inference

See original GitHub issue

🚀 Feature request - support fp16 inference

Right now most models support mixed precision for model training, but not for inference. Naively calling model= model.haf() makes the model generate junk instead of valid results for text generation, even though mixed-precision works fine in training.

If there’s a way to make the model produce stable behavior at 16-bit precision at inference, the throughput can potentially double on most modern GPUS.

Motivation

Double the speed is always attractive, especially since transformers are compute-intensive.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:4
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

13reactions
JulesGMcommented, Oct 1, 2021

has this been solved?

2reactions
johncookdscommented, Dec 29, 2020

Hi, I’ve noticed the same issue of the model randomly generating junk when using autocast within a custom generate() method with the only change below (fp16 is a boolean). From the above comments I thought this approach should’ve worked.

if fp16:
	with torch.cuda.amp.autocast():
		outputs = self(**model_inputs)
else:
	outputs = self(**model_inputs)

The current model I’ve tested it on is a huggingface gpt2 model finetuned on a personal dataset. Without fp16 the generate works perfectly. The dataset is very specific and the model is supposed to generate symbols+numbers, so it’s clear when it starts spitting out words during fp16 inference.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory and speed
We present some techniques and ideas to optimize Diffusers inference for memory or ... NVIDIA cuDNN supports many algorithms to compute a convolution....
Read more >
Time of inference in FP16 and FP32 is the same - Jetson TX2
Using a TX2 NX to build and run a TRT engine. I have made my onnx model and that is being converted into...
Read more >
Arm NN for GPU inference FP16 and FastMath
As we can see, using FP16 format for inference halves the amount of memory and bandwidth while doubling the performance, that is, more ......
Read more >
Training vs Inference - Numerical Precision
However, modern CPUs and GPUs support various floating-point data types ... When using FP16 for training, memory requirements are reduced by ...
Read more >
python - fp16 inference on cpu Pytorch
As I know, a lot of CPU-based operations in Pytorch are not implemented to support FP16; instead, it's NVIDIA GPUs that have hardware ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found