No speedup using inductor backend
See original GitHub issueš Describe the bug
With a simplified example like the following, no performance gain is observed when using dynamo:
import time
import torch
import torch._dynamo as dynamo
BATCHES, NUM_OF_HEADS, HEAD_SIZE, MAX_SEQ_LEN = 1, 12, 64, 1024
D_MODEL = NUM_OF_HEADS * HEAD_SIZE
@dynamo.optimize("inductor")
def torch_impl(hidden_states, attn_qkvw, attn_qkvb):
qkv = torch.matmul(hidden_states, attn_qkvw) + attn_qkvb
assert qkv.is_contiguous()
return qkv
def validate():
hidden_states = torch.normal(0, 1, size=(BATCHES, MAX_SEQ_LEN, D_MODEL), device='cuda', dtype=torch.float16)
attn_qkvw = torch.normal(0, 1, size=(D_MODEL, D_MODEL * 3), device='cuda', dtype=torch.float16)
attn_qkvb = torch.normal(0, 1, size=(D_MODEL * 3,), device='cuda', dtype=torch.float16)
torch_impl(hidden_states, attn_qkvw, attn_qkvb)
t = time.time()
for _ in range(100000):
output_torch = torch_impl(hidden_states, attn_qkvw, attn_qkvb)
print(time.time() - t)
print(output_torch)
print(output_torch.shape)
torch.manual_seed(0)
validate()
I have Triton installed on my V100 machine, and validated Triton is working as expected. However, with torch dynamo, it seems to me that Triton isnāt used to accelerate that MatMul operation.
With dynamo optimization, the running time increased from 7 seconds to 13 seconds.
Error logs
No errors encountered but performance is bad. Hereās the output from my side:
13.064171552658081 tensor([[[-3.3398e+00, 1.0947e+00, 3.1922e+01, ā¦, -2.7172e+01, 1.2508e+01, 4.1504e-02], [-5.0562e+01, 1.9312e+01, -2.9344e+01, ā¦, -4.1836e+00, -2.4531e+01, 1.4695e+01], [-9.2920e-01, -1.5734e+01, 2.9375e+01, ā¦, -1.3977e+01, 6.9297e+00, -7.6484e+00], ā¦, [ 5.6125e+01, -1.6625e+01, -2.4094e+01, ā¦, 2.8266e+01, -5.9781e+01, -5.2281e+01], [-5.3164e+00, -2.4766e+01, 4.9180e+00, ā¦, 3.2930e+00, 2.9500e+01, 3.9238e+00], [-6.7500e+01, 1.8406e+01, 2.4500e+01, ā¦, -1.1336e+01, -1.5219e+01, -3.8281e+00]]], device=ācuda:0ā, dtype=torch.float16) torch.Size([1, 1024, 2304])
Minified repro
No response
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
@yidoe For microbenchmarking, one thing you need to watch out for is disabling cudagraphs.
torch._inductor.config.triton.cudagraphs = False
.Cudagraphs is generally profitable for larger graphs, but it induces an extra copy of the input, which distorts microbenchmarking.
In general, though, I wouldnāt expect substantial acceleration from just having a single matmul. Generally, with a single matmul, CuBLAS (i.e. the matmul library that PyTorch uses under the hood) is pretty good, and thereās not a lot of opportunity to speed it up