Using FP16_Optimizer does not faster much
See original GitHub issueI run following scripts and compare the logs of them,
fp32 training:
python main_fp16_optimizer.py /workspace/data/imagenet
and fp16 mixed precision training:
python main_fp16_optimizer.py /workspace/data/imagenet --fp16
Here are theirs logs, fp32 training logs:
Epoch: [0][10/1563] Time 0.211 (0.507) Speed 151.834 (63.162) Data 0.001 (0.075) Loss 7.0819 (7.0585) Prec@1 0.000 (0.000) Prec@5 0.000 (0.000)
and fp16 mixed precision training logs:
Epoch: [0][10/1563] Time 0.220 (0.530) Speed 145.334 (60.358) Data 0.001 (0.068) Loss 7.1602 (7.0614) Prec@1 0.000 (0.852) Prec@5 0.000 (1.136)
It’s easy to find that the mixed precision training version didn’t faster much, so is there anything wrong?
btw, I used a single gpu. Thanks
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
How To Fit a Bigger Model and Train It Faster - Hugging Face
While bf16 has a worse precision than fp16, it has a much much bigger dynamic range. Therefore, if in the past you were...
Read more >Train With Mixed Precision - NVIDIA Documentation Center
Third, math operations run much faster in reduced precision, especially on GPUs with Tensor Core support for that precision.
Read more >Mixed precision training - fastai
So training at half precision is better for your memory usage, way faster if you have a Volta GPU (still a tiny bit...
Read more >Using mixed precision training with Gradient - Paperspace Blog
On the other hand, deep learning with FP16 takes less memory and runs more quickly, but with less precision in the data and...
Read more >PyTorch Quick Tip: Mixed Precision Training (FP16) - YouTube
FP16 approximately doubles your VRAM and trains much faster on newer GPUs. I think everyone should use this as a default.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
What gpu are you using? For those particular examples, I would only expect to see significant speedups on a device with Tensor Cores (Volta or Turing). Other architectures would benefit from the reduced bandwidth requirements of FP16, but the compute won’t be faster than FP32 (and for some Pascal cards like the 1080Ti, the compute throughput is actually much slower in FP16).
In general, GPUs like contiguous tensors in which the beginning each fastest-dim row is aligned to at least 32 bytes. The change you made may have helped with that requirement for some ops in the network, so the speedup you observed may have had nothing to do with cuDNN. Then again, it might also have made cuDNN’s padding job easier (cuDNN needs to transpose the data at certain points, and inserts padding while it transposes).
Alright, I’m going to be updating the documentation substantially anyway for the merge of my “Amp 1.0” release by the end of the month. I’m giving a webinar about that today if you’re interested. https://info.nvidia.com/webinar-mixed-precision-with-pytorch-reg-page.html Sorry, I should have remembered to say that earlier. I will post the presentation afterwards.