Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FakeCUDAModule has no attribute shfl_down_sync

See original GitHub issue

Reporting a bug

I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.

Hi ! I’m trying to reproduce the code presented in this article using Numba: https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/ In short, they use warp level intrinsics (__shfl_down) to perform reductions. I am aware of the Reduce helper (https://github.com/numba/numba/blob/main/numba/cuda/kernels/reduction.py) but it does not seems to use this instruction at first glance / it is interesting to try and implement it by myself either way (except if I’m missing some concept I understood this function is the most performant for this task). A first issue I found is that there is no __shfl_down in Numba, but rather shfl_down_sync to which I understand passing cuda.activemask() as first argument allows us to falls back to the original __shfl_down.

A second issue arise when I want to compile my kernel for regular use, it throws the error numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE Here, I try to set the env. variable NUMBA_ENABLE_CUDASIM to True, but it then throws the following error: AttributeError: tid=[302, 0, 0] ctaid=[0, 0, 0]: 'FakeCUDAModule' object has no attribute 'shfl_down_sync'

I managed to reproduce this error with the following snippet:

import os
os.environ['NUMBA_ENABLE_CUDASIM'] = '1'
import torch
from numba import cuda
bs = 8
ch = 3
who = 8
input = torch.rand(bs, ch, who, who, ch, who, who)
@cuda.jit
def shfl(val):
        offset = _WARPSIZE // 2
        while offset:
            val += cuda.shfl_down_sync(cuda.activemask(), val, offset)
            offset //= 2
        return val
shfl[1, 1024](input)

My Numba version is 0.56.2. Let me know if I can provide further details !

Issue Analytics

State:
Created a year ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

guilhermeleobascommented, Sep 10, 2022

I can reproduce the error now. I had to set _WARPSIZE = 32 to make the code compilable.

The function you’re trying to execute is indeed not support in the CUDA simulator yet as per the docs says:

Warps and warp-level operations are not yet implemented.

Outside the CUDA simulator, the code fails because cuda.shfl_down_sync(mask, value, delta) expects value to be either an integer or a float value. But in your case, you’re trying to use a torch tensor. Below is the code I was executing

import torch
from numba import cuda
bs = 8
ch = 3
who = 8
input = torch.rand(bs, ch, who, who, ch, who, who, device="cuda")

_WARPSIZE = 32

@cuda.jit
def shfl(val):
        offset = _WARPSIZE // 2
        while offset:
            val += cuda.shfl_down_sync(cuda.activemask(), val, offset)
            offset //= 2
        return val
shfl[1, 1024](cuda.as_cuda_array(input))

0reactions

stuartarchibaldcommented, Sep 12, 2022

@gmarkall any thoughts on whether, under the CUDA simulator, unsupported CUDA API calls ought to raise NotImplementedError or perhaps numba.errors.UnsupportedError with some error message specific to the simulator? Example:

from numba import cuda
import numpy as np


@cuda.jit
def foo(x):
    cuda.shfl_down_sync(cuda.activemask(), x, 2)


foo[1, 1](np.ones(4))

Currently gives:

$ NUMBA_ENABLE_CUDASIM=1 python issue8433.py 
...
issue8433.py", line 7, in foo
    cuda.shfl_down_sync(cuda.activemask(), x, 2)
AttributeError: tid=[0, 0, 0] ctaid=[0, 0, 0]: 'FakeCUDAModule' object has no attribute 'shfl_down_sync'