question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mixed precision training slower than FP32 training

See original GitHub issue

I’ve been doing some experiments on CIFAR10 with ResNets and decided to give APEX AMP a try.

However, I ran into some performance issues:

  1. AMP with pytorch’s torch.nn.parallel.DistributedDataParallel was extremely slow.
  2. AMP with apex.parallel.DistributedDataParallel was slower than the default training with torch.nn.DistributedDataParallel (no apex involved). For reference, normal training took about 15 min, while apex AMP training took 21 minutes (90 epochs on CIFAR-10 with ResNet20)

I followed the installation instructions, but I couldn’t install the C++ extensions because of my GCC/CUDA version. Does this justify this slowdown?

You can see the code here: https://github.com/braincreators/octconv/blob/34440209c4b37fb5198f75e4e8c052e92e80e85d/benchmarks/train.py#L1-L498

And run it (2 GPUs):

Without APEX AMP: python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1

With APEX AMP: python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1 --mixed-precision

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
zsef123commented, May 10, 2019

GTX 1080 TI have low-rate FP16 performance.

If you want to better performance with FP16, then must be using Volta architecture, or RTX series.

Check this topic https://devtalk.nvidia.com/default/topic/1023708/gpu-accelerated-libraries/fp16-support-on-gtx-1060-and-1080/

0reactions
patrickpjiangcommented, Mar 24, 2020

I notice there is an “ImportError”, so I resinatall the apex(with another pytorch version 1.4) and meet another problem named “version mismatch”, according to this https://github.com/NVIDIA/apex/pull/323 I deleted the some code about “matching version” and finally installed with no warning!

However, when I ran my test code, the traing time is still longer with O1 than O0 while memory cost is indeed slightly decreased, is that normal?

mode memroy time

O0 3855M 26s/epoch

O1 3557M 33s/epoch

Read more comments on GitHub >

github_iconTop Results From Across the Web

Train With Mixed Precision - NVIDIA Documentation Center
Mixed precision training achieves all these benefits while ensuring that no task-specific accuracy is lost compared to full precision training.
Read more >
Using mixed precision training with Gradient - Paperspace Blog
This blog post details the concept of mixed precision training, its benefits ... In effect, using FP32 will lead to a better overall...
Read more >
Slower Mixed precision than fp32 on 2080 Ti RTX
In both servers, I time pure fp32 computation as being significantly faster than mixed precision and I can't work out why.
Read more >
How To Fit a Bigger Model and Train It Faster - Hugging Face
If you own Ampere or newer hardware you can start using bf16 for your training and evaluation. While bf16 has a worse precision...
Read more >
Automatic Mixed Precision (AMP) Training
Mixed Precision Training (ICLR 2018). [2] M. Conley, M. Sun et al. ... Mixed-Precision. • A mixture of FP16 and FP32. ... are...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found