question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inductor - resnet18 - large batch size - CUDA error: an illegal memory access was encountered

See original GitHub issue

🐛 Describe the bug

Repro:

import torch
import torch._dynamo
import torch._inductor
from torch._inductor import config
import logging
from torchvision import models

resnet18 = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

batch_size = 4096
device = "cuda"

resnet18 = resnet18.eval().to(device)
opt_resnet18 = torch._dynamo.optimize("inductor")(resnet18)

input = torch.randn((batch_size, 3, 224, 224)).to(device)
output = opt_resnet18(input)
print(output.shape)

This only happens when batch size is large.

Error logs

Traceback (most recent call last):
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 458, in preserve_rng_state
    yield
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/compile_fx.py", line 202, in run
    compiled_fn = cudagraphify_impl(model, new_inputs, static_input_idxs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/compile_fx.py", line 257, in cudagraphify_impl
    model(list(static_inputs))
  File "/tmp/torchinductor_ybliang/7q/c7qimro7rryowl6fbgxobggppym6ux4mwk4x5htmdqso66ydxlb3.py", line 691, in call
    buf3 = empty_strided((4096, 64, 56, 56), (200704, 3136, 56, 1), device='cuda', dtype=torch.int64)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/ybliang/work/repos/pytorch/debug/debug5.py", line 35, in <module>
    output = opt_resnet18(input)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/eval_frame.py", line 138, in __call__
    return self.forward(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/eval_frame.py", line 135, in forward
    return optimized_forward(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/eval_frame.py", line 166, in _fn
    return fn(*args, **kwargs)
  File "/scratch/ybliang/work/repos/torchvision/torchvision/models/resnet.py", line 284, in forward
    def forward(self, x: Tensor) -> Tensor:
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/eval_frame.py", line 166, in _fn
    return fn(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/functorch/_src/aot_autograd.py", line 870, in forward
    return compiled_f(
  File "/scratch/ybliang/work/repos/pytorch/functorch/_src/aot_autograd.py", line 861, in new_func
    return compiled_fn(args)
  File "/scratch/ybliang/work/repos/pytorch/functorch/_src/aot_autograd.py", line 230, in g
    return f(*args)
  File "/scratch/ybliang/work/repos/pytorch/functorch/_src/aot_autograd.py", line 489, in compiled_function
    return CompiledFunction.apply(*remove_dupe_args(args))
  File "/scratch/ybliang/work/repos/pytorch/functorch/_src/aot_autograd.py", line 450, in forward
    fw_outs = call_func_with_args(
  File "/scratch/ybliang/work/repos/pytorch/functorch/_src/aot_autograd.py", line 255, in call_func_with_args
    out = normalize_as_list(f(args))
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/compile_fx.py", line 185, in run
    return model(new_inputs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/compile_fx.py", line 202, in run
    compiled_fn = cudagraphify_impl(model, new_inputs, static_input_idxs)
  File "/scratch/ybliang/work/env/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 462, in preserve_rng_state
    torch.cuda.set_rng_state(cuda_rng)
  File "/scratch/ybliang/work/repos/pytorch/torch/cuda/random.py", line 64, in set_rng_state
    _lazy_call(cb)
  File "/scratch/ybliang/work/repos/pytorch/torch/cuda/__init__.py", line 176, in _lazy_call
    callable()
  File "/scratch/ybliang/work/repos/pytorch/torch/cuda/random.py", line 62, in cb
    default_generator.set_state(new_state_copy)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Minified repro

No response

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ngimelcommented, Nov 2, 2022

No, this is real IMA, when there are more than INT_MAX elements triton doesn’t generate correct indexing, small(er) repro:

from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

import triton
import triton.language as tl
from torch._inductor.triton_ops.autotune import grid
from torch._C import _cuda_getCurrentRawStream as get_cuda_stream


kernel0 = async_compile.triton('''
import triton
import triton.language as tl
from torch._inductor.ir import ReductionHint
from torch._inductor.triton_ops.autotune import pointwise
from torch._inductor.utils import instance_descriptor

@pointwise(size_hints=[4294967296], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32', 6: 'u32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6), equal_to_1=())]})
@triton.jit
def kernel(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 3288334336
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
    xmask = xindex < xnumel
    x3 = xindex
    x1 = (xindex // 12544) % 64
    tmp0 = tl.load(in_ptr0 + (x3), xmask)
    tmp1 = tl.load(in_ptr1 + (x1), xmask)
    tmp3 = tl.load(in_ptr2 + (x1), xmask)
    tmp11 = tl.load(in_ptr3 + (x1), xmask)
    tmp13 = tl.load(in_ptr4 + (x1), xmask)
    tmp2 = tmp0 - tmp1
    tmp4 = 1e-05
    tmp5 = tmp3 + tmp4
    tmp6 = tl.sqrt(tmp5)
    tmp7 = 1 / tmp6
    tmp8 = 1
    tmp9 = tmp7 * tmp8
    tmp10 = tmp2 * tmp9
    tmp12 = tmp10 * tmp11
    tmp14 = tmp12 + tmp13
    tmp15 = tl.maximum(0, tmp14)
    tl.store(out_ptr0 + (x3 + tl.zeros([XBLOCK], tl.int32)), tmp15, xmask)
''')




async_compile.wait(globals())
del async_compile

def call(args):
    primals_1, primals_2, primals_3, primals_63, primals_64, primals_123 = args
    args.clear()
    buf0 = aten.convolution(primals_123, primals_1, None, (2, 2), (3, 3), (1, 1), False, (0, 0), 1)
    assert_size_stride(buf0, (4096, 64, 112, 112), (802816, 12544, 112, 1))
    buf1 = empty_strided((4096, 64, 112, 112), (802816, 12544, 112, 1), device='cuda', dtype=torch.float32)
    stream0 = get_cuda_stream(0)
    kernel0.run(buf0, primals_63, primals_64, primals_2, primals_3, buf1, 3288334336, grid=grid(3288334336), stream=stream0)
    return (buf1,)


if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    primals_1 = rand_strided((64, 3, 7, 7), (147, 49, 7, 1), device='cuda', dtype=torch.float32)
    primals_2 = rand_strided((64, ), (1, ), device='cuda', dtype=torch.float32)
    primals_3 = rand_strided((64, ), (1, ), device='cuda', dtype=torch.float32)
    primals_63 = rand_strided((64, ), (1, ), device='cuda', dtype=torch.float32)
    primals_64 = rand_strided((64, ), (1, ), device='cuda', dtype=torch.float32)
    primals_123 = rand_strided((4096, 3, 224, 224), (150528, 50176, 224, 1), device='cuda', dtype=torch.float32)
    print_performance(lambda: call([primals_1, primals_2, primals_3,  primals_63, primals_64,  primals_123]))

Note, we have an incorrect type annotation for xnumel here, but even after I make it i64 I still get IMA

1reaction
eellisoncommented, Nov 1, 2022

I’ll be posting a couple of resnet18 memory fixes for non-cudagraphs later today.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: CUDA error: an illegal memory access was ...
I met a strange illegal memory access error. It happens randomly without any regular ... I don't encounter it on smaller batch sizes....
Read more >
PyTorch CUDA error: an illegal memory access was ...
It was partially said by the answer of the OP, but the problem under the hood with illegal memory access is that the...
Read more >
CUDA error: an illegal memory access was encountered
I am getting a weird illegal memory access error whenever I try to train a FasterRCNN model with an image size of (1280,840,3)...
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
RuntimeError: CUDA error: out of memory ... And finally, when you're batch size is too high with respect to the dimensions of a...
Read more >
1.1.7 PDF - PyTorch Lightning Documentation
CUDA error: an illegal memory access was encountered. The solution is likely setting a specific CUDA,. CUDNN, PyTorch version combination.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found