Mixed precision training slower than FP32 training
See original GitHub issueI’ve been doing some experiments on CIFAR10 with ResNets and decided to give APEX AMP a try.
However, I ran into some performance issues:
- AMP with pytorch’s
torch.nn.parallel.DistributedDataParallel
was extremely slow. - AMP with
apex.parallel.DistributedDataParallel
was slower than the default training withtorch.nn.DistributedDataParallel
(no apex involved). For reference, normal training took about 15 min, while apex AMP training took 21 minutes (90 epochs on CIFAR-10 with ResNet20)
I followed the installation instructions, but I couldn’t install the C++ extensions because of my GCC/CUDA version. Does this justify this slowdown?
You can see the code here: https://github.com/braincreators/octconv/blob/34440209c4b37fb5198f75e4e8c052e92e80e85d/benchmarks/train.py#L1-L498
And run it (2 GPUs):
Without APEX AMP:
python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1
With APEX AMP:
python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1 --mixed-precision
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (2 by maintainers)
Top Results From Across the Web
Train With Mixed Precision - NVIDIA Documentation Center
Mixed precision training achieves all these benefits while ensuring that no task-specific accuracy is lost compared to full precision training.
Read more >Using mixed precision training with Gradient - Paperspace Blog
This blog post details the concept of mixed precision training, its benefits ... In effect, using FP32 will lead to a better overall...
Read more >Slower Mixed precision than fp32 on 2080 Ti RTX
In both servers, I time pure fp32 computation as being significantly faster than mixed precision and I can't work out why.
Read more >How To Fit a Bigger Model and Train It Faster - Hugging Face
If you own Ampere or newer hardware you can start using bf16 for your training and evaluation. While bf16 has a worse precision...
Read more >Automatic Mixed Precision (AMP) Training
Mixed Precision Training (ICLR 2018). [2] M. Conley, M. Sun et al. ... Mixed-Precision. • A mixture of FP16 and FP32. ... are...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
GTX 1080 TI have low-rate FP16 performance.
If you want to better performance with FP16, then must be using Volta architecture, or RTX series.
Check this topic https://devtalk.nvidia.com/default/topic/1023708/gpu-accelerated-libraries/fp16-support-on-gtx-1060-and-1080/
I notice there is an “ImportError”, so I resinatall the apex(with another pytorch version 1.4) and meet another problem named “version mismatch”, according to this https://github.com/NVIDIA/apex/pull/323 I deleted the some code about “matching version” and finally installed with no warning!
However, when I ran my test code, the traing time is still longer with O1 than O0 while memory cost is indeed slightly decreased, is that normal?
mode memroy time
O0 3855M 26s/epoch
O1 3557M 33s/epoch