RFC: `torch==1.12` will toggle `torch.backends.matmul.allow_tf32` to `False` - what should we do?
See original GitHub issueAmpere GPUs added a new mode called TF32. Pytorch created a new flag to support the TF32 mode enabling using torch.backends.matmul.allow_tf32
which has been True
by default in pytorch since it was added.
Having this mode on means that matrix multiplications when inputs were in FP32 were actually done in TF32, which made the math significantly faster, albeit less precise (TF32 has the dynamic range of BF16, and the precision of FP16).
The NVIDIA engineers have done many experiments and have found that Deep Learning training accuracy doesn’t get impacted for worse by using TF32 instead of FP32 (and often is better), but it provides a significant speed up. It’s easy to see from the A100 spec why:
FP32 | 19.5 TFLOPS
TF32 | 156 TFLOPS
(numbers with no sparsity)
And the accuracy tables are:
from Accelerating AI Training with NVIDIA TF32 Tensor Cores
However, the lost precision for some non-DL applications is a problem. Therefore starting from pytorch 1.12 (already in nightly shortly) the default for torch.backends.matmul.allow_tf32
will be False
, which won’t make the training accuracy worse, but it’ll make fp32 training significantly slower. So if you believe we should remain consistent/back compatible - most likely we should turn it back on for pt>1.11:
if version.parse(torch.__version__) > version.parse("1.11"):
torch.backends.matmul.allow_tf32 = True
at a single point which always gets executed for pytorch users.
The question is whether this should be done:
- Not at all - let the user sort it out
- Transformers-wide
- Only in HF Trainer (and Accelerate) and if not done add a new flag to let the user control the behavior
Additionally other use-modes should be made in sync:
- PyTorch/XLA (some other flag?)
Currently tf32 and how to flip it on/off is documented here: https://huggingface.co/docs/transformers/performance#tf32
A detailed discussion with multiple links to other related resources is here: https://dev-discuss.pytorch.org/t/pytorch-and-tensorfloat32/504
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:15 (10 by maintainers)
This is a very complicated as on the one hand, we don’t want to change the PyTorch default and surprise the user, but on the other hand we don’t want most of our beginner users to experience degraded performance in training on most GPUs without them knowing why (as this change will be hidden in PyTorch release notes).
I’m also in favor of not touching PyTorch’s default (the same way we don’t turn on things link
torch.backends.cudnn.benchmark
ortorch.backends.cudnn.deterministic
) and leave it to the user, but we do need proper documentation. Also in favor of having aTrainingArguments
flag to make it easier for the user to turn on in our examples.Small point of clarification: we have not changed the default to False at this time, but expect to do so in the future.
Agreed! This is the principal that motivated this change.
We will also have user-facing documentation beyond the release notes when this change is part of PyTorch release, because we agree this change has the potential to be surprising and disruptive to current Ampere users. We’ll also provide a recommendation for developers when making this change in nightlies.