Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mixed precision training slower than FP32 training

See original GitHub issue

I’ve been doing some experiments on CIFAR10 with ResNets and decided to give APEX AMP a try.

However, I ran into some performance issues:

AMP with pytorch’s torch.nn.parallel.DistributedDataParallel was extremely slow.
AMP with apex.parallel.DistributedDataParallel was slower than the default training with torch.nn.DistributedDataParallel (no apex involved). For reference, normal training took about 15 min, while apex AMP training took 21 minutes (90 epochs on CIFAR-10 with ResNet20)

I followed the installation instructions, but I couldn’t install the C++ extensions because of my GCC/CUDA version. Does this justify this slowdown?

You can see the code here: https://github.com/braincreators/octconv/blob/34440209c4b37fb5198f75e4e8c052e92e80e85d/benchmarks/train.py#L1-L498

And run it (2 GPUs):

Without APEX AMP: python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1

With APEX AMP: python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1 --mixed-precision

Issue Analytics

State:
Created 4 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

3reactions

zsef123commented, May 10, 2019

GTX 1080 TI have low-rate FP16 performance.

If you want to better performance with FP16, then must be using Volta architecture, or RTX series.

Check this topic https://devtalk.nvidia.com/default/topic/1023708/gpu-accelerated-libraries/fp16-support-on-gtx-1060-and-1080/

0reactions

patrickpjiangcommented, Mar 24, 2020

I notice there is an “ImportError”, so I resinatall the apex（with another pytorch version 1.4） and meet another problem named “version mismatch”, according to this https://github.com/NVIDIA/apex/pull/323 I deleted the some code about “matching version” and finally installed with no warning!

However, when I ran my test code, the traing time is still longer with O1 than O0 while memory cost is indeed slightly decreased, is that normal?