Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TorchInductor: "an illegal memory access" once and forever

See original GitHub issue

🐛 Describe the bug

I hit this when I’m trying to reproduce #1778, not sure if it’s exactly the same issue so I open a new one.

Repro:

from typing import List
import torch
import torch._dynamo
import torch._inductor
from torch._inductor import config
import logging
from torchvision import models
import math

# torch._dynamo.config.log_level = logging.DEBUG
# torch._dynamo.config.verbose = True
# torch._inductor.config.debug = True

def convert_size(size_bytes):
   if size_bytes == 0:
       return "0B"
   size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
   i = int(math.floor(math.log(size_bytes, 1024)))
   p = math.pow(1024, i)
   s = round(size_bytes / p, 2)
   return "%s %s" % (s, size_name[i])

resnet18 = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

batch_size = 4096
# batch_size = 1024
device = "cuda"

resnet18 = resnet18.eval().to(device)
opt_resnet18 = torch._dynamo.optimize("inductor")(resnet18)
# opt_resnet18 = resnet18

count = 0
while batch_size >= 500 and count < 5:
    try:
        print("batch size = ", batch_size)
        print("start: ", convert_size(torch.cuda.memory_allocated()))
        input = torch.randn((batch_size, 3, 224, 224)).to(device)
        output = opt_resnet18(input)
        print(output.shape)
    except RuntimeError as e:
        print(e)
        print("in runtime error: ", convert_size(torch.cuda.memory_allocated()))

    print("end: ", convert_size(torch.cuda.memory_allocated()))
    count += 1
    batch_size = int(batch_size / 2)

When running native Pytorch, you got:

batch size =  4096
start:  44.69 MB
CUDA out of memory. Tried to allocate 3.06 GiB (GPU 0; 39.41 GiB total capacity; 36.03 GiB already allocated; 1.71 GiB free; 36.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
in runtime error:  36.03 GB
end:  2.34 GB
batch size =  2048
start:  2.34 GB
CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 39.41 GiB total capacity; 37.56 GiB already allocated; 180.50 MiB free; 37.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
in runtime error:  37.56 GB
end:  1.19 GB
batch size =  1024
start:  1.19 GB
torch.Size([1024, 1000])
end:  21.2 GB
batch size =  512
start:  21.2 GB
torch.Size([512, 1000])
end:  10.63 GB

When running Dynamo + Inductor, you got:

batch size =  4096
start:  44.69 MB
CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
in runtime error:  32.2 GB
end:  2.34 GB
batch size =  2048
start:  2.34 GB
CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
in runtime error:  2.34 GB
end:  2.34 GB
batch size =  1024
start:  2.34 GB
CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
in runtime error:  2.34 GB
end:  2.34 GB
batch size =  512
start:  2.34 GB
CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
in runtime error:  2.34 GB
end:  2.34 GB

Actually batch size = 1024 is the first batch size w/o error during the search. But w/ inductor, it keeps failing even batch size is 1024 or less, I think we are using the same generated triton code, which doesn’t change as the input shape changes.