Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No speedup using inductor backend

See original GitHub issue

🐛 Describe the bug

With a simplified example like the following, no performance gain is observed when using dynamo:

import time
import torch
import torch._dynamo as dynamo


BATCHES, NUM_OF_HEADS, HEAD_SIZE, MAX_SEQ_LEN = 1, 12, 64, 1024
D_MODEL = NUM_OF_HEADS * HEAD_SIZE


@dynamo.optimize("inductor")
def torch_impl(hidden_states, attn_qkvw, attn_qkvb):
    qkv = torch.matmul(hidden_states, attn_qkvw) + attn_qkvb
    
    assert qkv.is_contiguous()
    return qkv
    
    
def validate():
    hidden_states = torch.normal(0, 1, size=(BATCHES, MAX_SEQ_LEN, D_MODEL), device='cuda', dtype=torch.float16)
    attn_qkvw = torch.normal(0, 1, size=(D_MODEL, D_MODEL * 3), device='cuda', dtype=torch.float16)
    attn_qkvb = torch.normal(0, 1, size=(D_MODEL * 3,), device='cuda', dtype=torch.float16)
    
    torch_impl(hidden_states, attn_qkvw, attn_qkvb)
    t = time.time()
    
    for _ in range(100000):
        output_torch = torch_impl(hidden_states, attn_qkvw, attn_qkvb)
    
    print(time.time() - t)
    print(output_torch)
    print(output_torch.shape)
    
    
torch.manual_seed(0)
validate()

I have Triton installed on my V100 machine, and validated Triton is working as expected. However, with torch dynamo, it seems to me that Triton isn’t used to accelerate that MatMul operation.

With dynamo optimization, the running time increased from 7 seconds to 13 seconds.

Error logs

No errors encountered but performance is bad. Here’s the output from my side:

13.064171552658081 tensor([[[-3.3398e+00, 1.0947e+00, 3.1922e+01, …, -2.7172e+01, 1.2508e+01, 4.1504e-02], [-5.0562e+01, 1.9312e+01, -2.9344e+01, …, -4.1836e+00, -2.4531e+01, 1.4695e+01], [-9.2920e-01, -1.5734e+01, 2.9375e+01, …, -1.3977e+01, 6.9297e+00, -7.6484e+00], …, [ 5.6125e+01, -1.6625e+01, -2.4094e+01, …, 2.8266e+01, -5.9781e+01, -5.2281e+01], [-5.3164e+00, -2.4766e+01, 4.9180e+00, …, 3.2930e+00, 2.9500e+01, 3.9238e+00], [-6.7500e+01, 1.8406e+01, 2.4500e+01, …, -1.1336e+01, -1.5219e+01, -3.8281e+00]]], device=‘cuda:0’, dtype=torch.float16) torch.Size([1, 1024, 2304])

Minified repro

No response

Issue Analytics

State:
Created a year ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

Chilleecommented, Nov 7, 2022

@yidoe For microbenchmarking, one thing you need to watch out for is disabling cudagraphs. torch._inductor.config.triton.cudagraphs = False.

Cudagraphs is generally profitable for larger graphs, but it induces an extra copy of the input, which distorts microbenchmarking.

1reaction

Chilleecommented, Nov 6, 2022

In general, though, I wouldn’t expect substantial acceleration from just having a single matmul. Generally, with a single matmul, CuBLAS (i.e. the matmul library that PyTorch uses under the hood) is pretty good, and there’s not a lot of opportunity to speed it up

Top Results From Across the Web

Speed up using torch.compile()? #634 - openai/whisper - GitHub

Has anyone experienced any performance gains using torch.compile() in PyTorch 2.0? ... backend = "inductor" model = whisper.load_model("base") model.encoder ...

TorchInductor Update 3: E2E model training with ...

TorchDynamo can capture graph for forward() and optimizer.step(), but the TorchInductor backend can speed up forward(), backward() and optimizer ...

Efficient Training on a Single GPU - Hugging Face

This is useful for debugging, and unlikely to give speedups. Training & inference backends: dynamo.optimize("inductor") - Uses TorchInductor backend with ...

Optimization of Integrated Spiral Inductors Using Sequential ...

back-end device-parameter extraction engine which makes the algorithm suitable to the optimization at any frequency range. In addition, compared with ...

Inductance issues not what they seem at 90 nm - EE Times

Inductance in metal interconnect will force design teams to deal with a ... inductance problems can be created by the routing tools during...