Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Understanding FP16 training: why slower on Tesla GPU?

See original GitHub issue

Hello fairseq team!

I’ve been experimenting with FP16 vs. FP32 but I did not understand how to properly configure fairseq in order to boost training. I’m not talking of the fairseq-train utility, I’m trying to reproduce the right conditions in a much simpler environment: here’s a minimal script to reproduce my problem fairseq_fp16_test.py.txt.

So basically I create a transformer model with a dummy testset, run 100 training steps and printing the time needed.

On a GTX 1080 Ti the result is the following:

[...]
FP16=False, time=8.963897
[...]
FP16=True, time=16.223331

And that’s ok! Because 1080 does not support FP16.

However here’s the result on a Tesla T4 (which support FP16 as far as I know):

[...]
FP16=False, time=12.849469
[...]
FP16=True, time=13.555646

That’s strange! I was not able to speed up training, on the contrary it slightly increased! Looking at the script, you’ll notice that the changes are truly minimal:

# line 63
    if args.fp16:
        model = model.half()

# line 74
    if args.fp16:
        optimizer = optim.FP16Optimizer.build_optimizer(args, params)
    else:
        optimizer = optim.build_optimizer(args, params)

What is the problem with the current implementation? Why I am not able to see the desired speedup?

Thanks in advance for all your help!

Issue Analytics

State:
Created 4 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

3reactions

myleottcommented, Aug 20, 2019

Ah, right didn’t notice that. Yes, FP16 adds overhead so it’s only useful for larger batch sizes.

In general pure FP16 training hurts model quality quite a bit. The “mixed precision” recipe recommended by Nvidia is to keep both an FP32 and FP16 copy of the model, do the forward/backward in FP16 and compute the loss, do optimization, and update model parameters in FP32. This is what --fp16 does. This adds overhead both in speed and memory, but typically Tensor Cores are fast enough to overcome that (esp. for larger batch sizes).

We also have a variant that is closer to pure FP16 called --memory-efficient-fp16. It does pure FP16 for forward/backward/loss, but casts to FP32 on-the-fly in the optimizer when updating the momentum buffers. It may still be slower (not sure for your problem size), but is typically a lot more memory efficient since you don’t maintain an FP32 copy of the model

0reactions

gvskalyancommented, Aug 31, 2019

@davidecaroselli can you please try conda install pytorch cudatoolkit=10.0 -c pytorch and run the test

Top Results From Across the Web

Understanding FP16 training: why slower on Tesla GPU? #1047

In general pure FP16 training hurts model quality quite a bit. The "mixed precision" recipe recommended by Nvidia is to keep both an...

Train With Mixed Precision - NVIDIA Documentation Center

First, they require less memory, enabling the training and deployment of larger neural networks. Second, they require less memory bandwidth ...

How To Fit a Bigger Model and Train It Faster - Hugging Face

While bf16 has a worse precision than fp16, it has a much much bigger dynamic range. Therefore, if in the past you were...

Floating point precision (FP16 vs. FP32) | Deep Learning with ...

Although FP16 makes training faster, it requires extra work in the neural network model implementation to match the accuracy achieved with FP32. This...

Benchmarking GPUs for Mixed Precision Training with Deep ...

Recent versions of NVIDIA GPUs include particularly unique tensor cores developed specifically for rapid FP16 matrix computations. But, until relatively ...