FakeCUDAModule has no attribute shfl_down_sync
See original GitHub issueReporting a bug
- I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
- I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.
Hi !
I’m trying to reproduce the code presented in this article using Numba: https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/
In short, they use warp level intrinsics (__shfl_down) to perform reductions.
I am aware of the Reduce helper (https://github.com/numba/numba/blob/main/numba/cuda/kernels/reduction.py) but it does not seems to use this instruction at first glance / it is interesting to try and implement it by myself either way (except if I’m missing some concept I understood this function is the most performant for this task).
A first issue I found is that there is no __shfl_down in Numba, but rather shfl_down_sync to which I understand passing cuda.activemask() as first argument allows us to falls back to the original __shfl_down.
A second issue arise when I want to compile my kernel for regular use, it throws the error numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
Here, I try to set the env. variable NUMBA_ENABLE_CUDASIM to True, but it then throws the following error:
AttributeError: tid=[302, 0, 0] ctaid=[0, 0, 0]: 'FakeCUDAModule' object has no attribute 'shfl_down_sync'
I managed to reproduce this error with the following snippet:
import os
os.environ['NUMBA_ENABLE_CUDASIM'] = '1'
import torch
from numba import cuda
bs = 8
ch = 3
who = 8
input = torch.rand(bs, ch, who, who, ch, who, who)
@cuda.jit
def shfl(val):
offset = _WARPSIZE // 2
while offset:
val += cuda.shfl_down_sync(cuda.activemask(), val, offset)
offset //= 2
return val
shfl[1, 1024](input)
My Numba version is 0.56.2. Let me know if I can provide further details !
Issue Analytics
- State:
- Created a year ago
- Comments:5 (1 by maintainers)

Top Related StackOverflow Question
I can reproduce the error now. I had to set
_WARPSIZE = 32to make the code compilable.The function you’re trying to execute is indeed not support in the CUDA simulator yet as per the docs says:
Outside the CUDA simulator, the code fails because
cuda.shfl_down_sync(mask, value, delta)expectsvalueto be either an integer or a float value. But in your case, you’re trying to use a torch tensor. Below is the code I was executing@gmarkall any thoughts on whether, under the CUDA simulator, unsupported CUDA API calls ought to raise
NotImplementedErroror perhapsnumba.errors.UnsupportedErrorwith some error message specific to the simulator? Example:Currently gives: