Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support fp16 for inference

See original GitHub issue

🚀 Feature request - support fp16 inference

Right now most models support mixed precision for model training, but not for inference. Naively calling model= model.haf() makes the model generate junk instead of valid results for text generation, even though mixed-precision works fine in training.

If there’s a way to make the model produce stable behavior at 16-bit precision at inference, the throughput can potentially double on most modern GPUS.

Motivation

Double the speed is always attractive, especially since transformers are compute-intensive.

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:6 (2 by maintainers)

Top GitHub Comments

13reactions

JulesGMcommented, Oct 1, 2021

has this been solved?

2reactions

johncookdscommented, Dec 29, 2020

Hi, I’ve noticed the same issue of the model randomly generating junk when using autocast within a custom generate() method with the only change below (fp16 is a boolean). From the above comments I thought this approach should’ve worked.

if fp16:
	with torch.cuda.amp.autocast():
		outputs = self(**model_inputs)
else:
	outputs = self(**model_inputs)

The current model I’ve tested it on is a huggingface gpt2 model finetuned on a personal dataset. Without fp16 the generate works perfectly. The dataset is very specific and the model is supposed to generate symbols+numbers, so it’s clear when it starts spitting out words during fp16 inference.

Top Results From Across the Web

Memory and speed

We present some techniques and ideas to optimize Diffusers inference for memory or ... NVIDIA cuDNN supports many algorithms to compute a convolution....

Time of inference in FP16 and FP32 is the same - Jetson TX2

Using a TX2 NX to build and run a TRT engine. I have made my onnx model and that is being converted into...

Arm NN for GPU inference FP16 and FastMath

As we can see, using FP16 format for inference halves the amount of memory and bandwidth while doubling the performance, that is, more ......

Training vs Inference - Numerical Precision

However, modern CPUs and GPUs support various floating-point data types ... When using FP16 for training, memory requirements are reduced by ...

python - fp16 inference on cpu Pytorch

As I know, a lot of CPU-based operations in Pytorch are not implemented to support FP16; instead, it's NVIDIA GPUs that have hardware ......