Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using FP16_Optimizer does not faster much

See original GitHub issue

I run following scripts and compare the logs of them,
fp32 training:

python main_fp16_optimizer.py /workspace/data/imagenet

and fp16 mixed precision training:

python main_fp16_optimizer.py /workspace/data/imagenet --fp16

Here are theirs logs, fp32 training logs:

Epoch: [0][10/1563]     Time 0.211 (0.507)      Speed 151.834 (63.162)  Data 0.001 (0.075)      Loss 7.0819 (7.0585)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)

and fp16 mixed precision training logs:

Epoch: [0][10/1563]     Time 0.220 (0.530)      Speed 145.334 (60.358)  Data 0.001 (0.068)      Loss 7.1602 (7.0614)    Prec@1 0.000 (0.852)    Prec@5 0.000 (1.136)

It’s easy to find that the mixed precision training version didn’t faster much, so is there anything wrong?

btw, I used a single gpu. Thanks

Issue Analytics

State:
Created 5 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

3reactions

mcarillicommented, Feb 12, 2019

What gpu are you using? For those particular examples, I would only expect to see significant speedups on a device with Tensor Cores (Volta or Turing). Other architectures would benefit from the reduced bandwidth requirements of FP16, but the compute won’t be faster than FP32 (and for some Pascal cards like the 1080Ti, the compute throughput is actually much slower in FP16).

0reactions

mcarillicommented, Feb 20, 2019

In general, GPUs like contiguous tensors in which the beginning each fastest-dim row is aligned to at least 32 bytes. The change you made may have helped with that requirement for some ops in the network, so the speedup you observed may have had nothing to do with cuDNN. Then again, it might also have made cuDNN’s padding job easier (cuDNN needs to transpose the data at certain points, and inserts padding while it transposes).

Alright, I’m going to be updating the documentation substantially anyway for the merge of my “Amp 1.0” release by the end of the month. I’m giving a webinar about that today if you’re interested. https://info.nvidia.com/webinar-mixed-precision-with-pytorch-reg-page.html Sorry, I should have remembered to say that earlier. I will post the presentation afterwards.