Understanding FP16 training: why slower on Tesla GPU?
See original GitHub issueHello fairseq team!
I’ve been experimenting with FP16 vs. FP32 but I did not understand how to properly configure fairseq in order to boost training. I’m not talking of the fairseq-train
utility, I’m trying to reproduce the right conditions in a much simpler environment: here’s a minimal script to reproduce my problem fairseq_fp16_test.py.txt.
So basically I create a transformer model with a dummy testset, run 100 training steps and printing the time needed.
On a GTX 1080 Ti the result is the following:
[...]
FP16=False, time=8.963897
[...]
FP16=True, time=16.223331
And that’s ok! Because 1080 does not support FP16.
However here’s the result on a Tesla T4 (which support FP16 as far as I know):
[...]
FP16=False, time=12.849469
[...]
FP16=True, time=13.555646
That’s strange! I was not able to speed up training, on the contrary it slightly increased! Looking at the script, you’ll notice that the changes are truly minimal:
# line 63
if args.fp16:
model = model.half()
# line 74
if args.fp16:
optimizer = optim.FP16Optimizer.build_optimizer(args, params)
else:
optimizer = optim.build_optimizer(args, params)
What is the problem with the current implementation? Why I am not able to see the desired speedup?
Thanks in advance for all your help!
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
Ah, right didn’t notice that. Yes, FP16 adds overhead so it’s only useful for larger batch sizes.
In general pure FP16 training hurts model quality quite a bit. The “mixed precision” recipe recommended by Nvidia is to keep both an FP32 and FP16 copy of the model, do the forward/backward in FP16 and compute the loss, do optimization, and update model parameters in FP32. This is what
--fp16
does. This adds overhead both in speed and memory, but typically Tensor Cores are fast enough to overcome that (esp. for larger batch sizes).We also have a variant that is closer to pure FP16 called
--memory-efficient-fp16
. It does pure FP16 for forward/backward/loss, but casts to FP32 on-the-fly in the optimizer when updating the momentum buffers. It may still be slower (not sure for your problem size), but is typically a lot more memory efficient since you don’t maintain an FP32 copy of the model@davidecaroselli can you please try
conda install pytorch cudatoolkit=10.0 -c pytorch
and run the test