question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

No speedup using inductor backend

See original GitHub issue

šŸ› Describe the bug

With a simplified example like the following, no performance gain is observed when using dynamo:

import time
import torch
import torch._dynamo as dynamo


BATCHES, NUM_OF_HEADS, HEAD_SIZE, MAX_SEQ_LEN = 1, 12, 64, 1024
D_MODEL = NUM_OF_HEADS * HEAD_SIZE


@dynamo.optimize("inductor")
def torch_impl(hidden_states, attn_qkvw, attn_qkvb):
    qkv = torch.matmul(hidden_states, attn_qkvw) + attn_qkvb
    
    assert qkv.is_contiguous()
    return qkv
    
    
def validate():
    hidden_states = torch.normal(0, 1, size=(BATCHES, MAX_SEQ_LEN, D_MODEL), device='cuda', dtype=torch.float16)
    attn_qkvw = torch.normal(0, 1, size=(D_MODEL, D_MODEL * 3), device='cuda', dtype=torch.float16)
    attn_qkvb = torch.normal(0, 1, size=(D_MODEL * 3,), device='cuda', dtype=torch.float16)
    
    torch_impl(hidden_states, attn_qkvw, attn_qkvb)
    t = time.time()
    
    for _ in range(100000):
        output_torch = torch_impl(hidden_states, attn_qkvw, attn_qkvb)
    
    print(time.time() - t)
    print(output_torch)
    print(output_torch.shape)
    
    
torch.manual_seed(0)
validate()

I have Triton installed on my V100 machine, and validated Triton is working as expected. However, with torch dynamo, it seems to me that Triton isnā€™t used to accelerate that MatMul operation.

With dynamo optimization, the running time increased from 7 seconds to 13 seconds.

Error logs

No errors encountered but performance is bad. Hereā€™s the output from my side:

13.064171552658081 tensor([[[-3.3398e+00, 1.0947e+00, 3.1922e+01, ā€¦, -2.7172e+01, 1.2508e+01, 4.1504e-02], [-5.0562e+01, 1.9312e+01, -2.9344e+01, ā€¦, -4.1836e+00, -2.4531e+01, 1.4695e+01], [-9.2920e-01, -1.5734e+01, 2.9375e+01, ā€¦, -1.3977e+01, 6.9297e+00, -7.6484e+00], ā€¦, [ 5.6125e+01, -1.6625e+01, -2.4094e+01, ā€¦, 2.8266e+01, -5.9781e+01, -5.2281e+01], [-5.3164e+00, -2.4766e+01, 4.9180e+00, ā€¦, 3.2930e+00, 2.9500e+01, 3.9238e+00], [-6.7500e+01, 1.8406e+01, 2.4500e+01, ā€¦, -1.1336e+01, -1.5219e+01, -3.8281e+00]]], device=ā€˜cuda:0ā€™, dtype=torch.float16) torch.Size([1, 1024, 2304])

Minified repro

No response

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
Chilleecommented, Nov 7, 2022

@yidoe For microbenchmarking, one thing you need to watch out for is disabling cudagraphs. torch._inductor.config.triton.cudagraphs = False.

Cudagraphs is generally profitable for larger graphs, but it induces an extra copy of the input, which distorts microbenchmarking.

1reaction
Chilleecommented, Nov 6, 2022

In general, though, I wouldnā€™t expect substantial acceleration from just having a single matmul. Generally, with a single matmul, CuBLAS (i.e. the matmul library that PyTorch uses under the hood) is pretty good, and thereā€™s not a lot of opportunity to speed it up

Read more comments on GitHub >

github_iconTop Results From Across the Web

Speed up using torch.compile()? #634 - openai/whisper - GitHub
Has anyone experienced any performance gains using torch.compile() in PyTorch 2.0? ... backend = "inductor" model = whisper.load_model("base") model.encoderĀ ...
Read more >
TorchInductor Update 3: E2E model training with ...
TorchDynamo can capture graph for forward() and optimizer.step(), but the TorchInductor backend can speed up forward(), backward() and optimizerĀ ...
Read more >
Efficient Training on a Single GPU - Hugging Face
This is useful for debugging, and unlikely to give speedups. Training & inference backends: dynamo.optimize("inductor") - Uses TorchInductor backend with ...
Read more >
Optimization of Integrated Spiral Inductors Using Sequential ...
back-end device-parameter extraction engine which makes the algorithm suitable to the optimization at any frequency range. In addition, compared with ...
Read more >
Inductance issues not what they seem at 90 nm - EE Times
Inductance in metal interconnect will force design teams to deal with a ... inductance problems can be created by the routing tools during...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found