Support fp16 for inference
See original GitHub issue🚀 Feature request - support fp16 inference
Right now most models support mixed precision for model training, but not for inference. Naively calling model= model.haf()
makes the model generate junk instead of valid results for text generation, even though mixed-precision works fine in training.
If there’s a way to make the model produce stable behavior at 16-bit precision at inference, the throughput can potentially double on most modern GPUS.
Motivation
Double the speed is always attractive, especially since transformers are compute-intensive.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Memory and speed
We present some techniques and ideas to optimize Diffusers inference for memory or ... NVIDIA cuDNN supports many algorithms to compute a convolution....
Read more >Time of inference in FP16 and FP32 is the same - Jetson TX2
Using a TX2 NX to build and run a TRT engine. I have made my onnx model and that is being converted into...
Read more >Arm NN for GPU inference FP16 and FastMath
As we can see, using FP16 format for inference halves the amount of memory and bandwidth while doubling the performance, that is, more ......
Read more >Training vs Inference - Numerical Precision
However, modern CPUs and GPUs support various floating-point data types ... When using FP16 for training, memory requirements are reduced by ...
Read more >python - fp16 inference on cpu Pytorch
As I know, a lot of CPU-based operations in Pytorch are not implemented to support FP16; instead, it's NVIDIA GPUs that have hardware ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
has this been solved?
Hi, I’ve noticed the same issue of the model randomly generating junk when using autocast within a custom generate() method with the only change below (fp16 is a boolean). From the above comments I thought this approach should’ve worked.
The current model I’ve tested it on is a huggingface gpt2 model finetuned on a personal dataset. Without fp16 the generate works perfectly. The dataset is very specific and the model is supposed to generate symbols+numbers, so it’s clear when it starts spitting out words during fp16 inference.