question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Understanding FP16 training: why slower on Tesla GPU?

See original GitHub issue

Hello fairseq team!

I’ve been experimenting with FP16 vs. FP32 but I did not understand how to properly configure fairseq in order to boost training. I’m not talking of the fairseq-train utility, I’m trying to reproduce the right conditions in a much simpler environment: here’s a minimal script to reproduce my problem fairseq_fp16_test.py.txt.

So basically I create a transformer model with a dummy testset, run 100 training steps and printing the time needed.

On a GTX 1080 Ti the result is the following:

[...]
FP16=False, time=8.963897
[...]
FP16=True, time=16.223331

And that’s ok! Because 1080 does not support FP16.

However here’s the result on a Tesla T4 (which support FP16 as far as I know):

[...]
FP16=False, time=12.849469
[...]
FP16=True, time=13.555646

That’s strange! I was not able to speed up training, on the contrary it slightly increased! Looking at the script, you’ll notice that the changes are truly minimal:

# line 63
    if args.fp16:
        model = model.half()

# line 74
    if args.fp16:
        optimizer = optim.FP16Optimizer.build_optimizer(args, params)
    else:
        optimizer = optim.build_optimizer(args, params)

What is the problem with the current implementation? Why I am not able to see the desired speedup?

Thanks in advance for all your help!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
myleottcommented, Aug 20, 2019

Ah, right didn’t notice that. Yes, FP16 adds overhead so it’s only useful for larger batch sizes.

In general pure FP16 training hurts model quality quite a bit. The “mixed precision” recipe recommended by Nvidia is to keep both an FP32 and FP16 copy of the model, do the forward/backward in FP16 and compute the loss, do optimization, and update model parameters in FP32. This is what --fp16 does. This adds overhead both in speed and memory, but typically Tensor Cores are fast enough to overcome that (esp. for larger batch sizes).

We also have a variant that is closer to pure FP16 called --memory-efficient-fp16. It does pure FP16 for forward/backward/loss, but casts to FP32 on-the-fly in the optimizer when updating the momentum buffers. It may still be slower (not sure for your problem size), but is typically a lot more memory efficient since you don’t maintain an FP32 copy of the model

0reactions
gvskalyancommented, Aug 31, 2019

@davidecaroselli can you please try conda install pytorch cudatoolkit=10.0 -c pytorch and run the test

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understanding FP16 training: why slower on Tesla GPU? #1047
In general pure FP16 training hurts model quality quite a bit. The "mixed precision" recipe recommended by Nvidia is to keep both an...
Read more >
Train With Mixed Precision - NVIDIA Documentation Center
First, they require less memory, enabling the training and deployment of larger neural networks. Second, they require less memory bandwidth ...
Read more >
How To Fit a Bigger Model and Train It Faster - Hugging Face
While bf16 has a worse precision than fp16, it has a much much bigger dynamic range. Therefore, if in the past you were...
Read more >
Floating point precision (FP16 vs. FP32) | Deep Learning with ...
Although FP16 makes training faster, it requires extra work in the neural network model implementation to match the accuracy achieved with FP32. This...
Read more >
Benchmarking GPUs for Mixed Precision Training with Deep ...
Recent versions of NVIDIA GPUs include particularly unique tensor cores developed specifically for rapid FP16 matrix computations. But, until relatively ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found