Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC: `torch==1.12` will toggle `torch.backends.matmul.allow_tf32` to `False` - what should we do?

See original GitHub issue

Ampere GPUs added a new mode called TF32. Pytorch created a new flag to support the TF32 mode enabling using torch.backends.matmul.allow_tf32 which has been True by default in pytorch since it was added.

Having this mode on means that matrix multiplications when inputs were in FP32 were actually done in TF32, which made the math significantly faster, albeit less precise (TF32 has the dynamic range of BF16, and the precision of FP16).

The NVIDIA engineers have done many experiments and have found that Deep Learning training accuracy doesn’t get impacted for worse by using TF32 instead of FP32 (and often is better), but it provides a significant speed up. It’s easy to see from the A100 spec why:

FP32 |  19.5 TFLOPS
TF32 | 156   TFLOPS

(numbers with no sparsity)

And the accuracy tables are: AI_training_TF32_tensor_cores_F3-1024x565 from Accelerating AI Training with NVIDIA TF32 Tensor Cores

However, the lost precision for some non-DL applications is a problem. Therefore starting from pytorch 1.12 (already in nightly shortly) the default for torch.backends.matmul.allow_tf32 will be False, which won’t make the training accuracy worse, but it’ll make fp32 training significantly slower. So if you believe we should remain consistent/back compatible - most likely we should turn it back on for pt>1.11:

if version.parse(torch.__version__) > version.parse("1.11"):
    torch.backends.matmul.allow_tf32 = True

at a single point which always gets executed for pytorch users.

The question is whether this should be done:

Not at all - let the user sort it out
Transformers-wide
Only in HF Trainer (and Accelerate) and if not done add a new flag to let the user control the behavior

Additionally other use-modes should be made in sync:

PyTorch/XLA (some other flag?)

Currently tf32 and how to flip it on/off is documented here: https://huggingface.co/docs/transformers/performance#tf32

A detailed discussion with multiple links to other related resources is here: https://dev-discuss.pytorch.org/t/pytorch-and-tensorfloat32/504

@LysandreJik, @sgugger, @patrickvonplaten, @patil-suraj

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:15 (10 by maintainers)

Top GitHub Comments

5reactions

sguggercommented, Apr 5, 2022

This is a very complicated as on the one hand, we don’t want to change the PyTorch default and surprise the user, but on the other hand we don’t want most of our beginner users to experience degraded performance in training on most GPUs without them knowing why (as this change will be hidden in PyTorch release notes).

I’m also in favor of not touching PyTorch’s default (the same way we don’t turn on things link torch.backends.cudnn.benchmark or torch.backends.cudnn.deterministic) and leave it to the user, but we do need proper documentation. Also in favor of having a TrainingArguments flag to make it easier for the user to turn on in our examples.

2reactions

mruberrycommented, Apr 5, 2022

Just to understand: PyTorch added tf32, set it to True by default (from which version to which version?) and now reverted the default to False?

Small point of clarification: we have not changed the default to False at this time, but expect to do so in the future.

Also I think it’s a good rule of thumb that in PyTorch by default, always the highest precision, lowest speed is enabled.

Agreed! This is the principal that motivated this change.

We will also have user-facing documentation beyond the release notes when this change is part of PyTorch release, because we agree this change has the potential to be surprising and disruptive to current Ampere users. We’ll also provide a recommendation for developers when making this change in nightlies.

Top Results From Across the Web

torch.backends — PyTorch 1.13 documentation

A bool that controls whether TensorFloat-32 tensor cores may be used in matrix multiplications on Ampere or newer GPUs. See TensorFloat-32(TF32) on Ampere ......

Considerable absolute error in torch.matmul - PyTorch Forums

Hi, I've got high absolute error in torch.matmul. Here is my exmaple code. import torch from torch import tensor def main(): ...

PyTorch and TensorFloat32 - NVIDIA CUDA

The behavior can be controlled with two global switches: torch.backends.matmul.allow_tf32 and; torch.backends.cudnn.allow_tf32. Making TensorFloat32 the default ...

torch.set_float32_matmul_precision - PyTorch

If “high” or “medium” are set then the TensorFloat32 datatype will be used when computing float32 matrix multiplications, equivalent to setting torch.backends.

Numerical accuracy — PyTorch 1.13 documentation

We recommend enabling TF32 tensor cores for matrix multiplications with torch.backends.cuda.matmul.allow_tf32 = True if your network does not need full ...