Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No implementation of CUDA shared.array error

See original GitHub issue

Reporting a bug

I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.

Problem description

When trying to initialize a 7 dimensional shared array, an error is thrown by the Numba compiler.

MWE

import numba
import torch
import math

from numba import cuda

@cuda.jit
def op_numba_c(input):
    sharedI = cuda.shared.array(shape=(cuda.blockDim.x, numba.int32(input.shape[1]), numba.int32(input.shape[2]), numba.int32(input.shape[3]), numba.int32(input.shape[4]), numba.int32(input.shape[5]), numba.int32(input.shape[6])), dtype=numba.float32)


if __name__ == '__main__':
    device = torch.device('cuda:0')
    input = torch.rand(2, 3, 7, 7, 3, 7, 7).to(device)
    def t2nb(ten):
            return numba.cuda.as_cuda_array(ten)
        
    # TODO fidling with threads
    threadsperblock= (16, 16)
    blockspergrid_x = math.ceil(input.shape[0] / threadsperblock[0])
    blockspergrid_y = math.ceil(input.shape[1] / threadsperblock[1])

    blockspergrid = (blockspergrid_x, blockspergrid_y)
    i_2 = input.clone()
    op_numba_c[blockspergrid, threadsperblock](t2nb(i_2))

Std out

Traceback removed for clarity

numba.core.errors.TypingError: Failed in cuda mode pipeline (step: nopython frontend)
No implementation of function Function(<function shared.array at 0x7fd3668320d0>) found for signature:
 
 >>> array(shape=UniTuple(int32 x 7), dtype=class(float32))
 
There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload of function 'array': File: numba/cuda/cudadecl.py: Line 46.
    With argument(s): '(shape=UniTuple(int32 x 7), dtype=class(float32))':
   No match.

During: resolving callee type: Function(<function shared.array at 0x7fd3668320d0>)
During: typing of call at /home/sebastien/workspace/MemSE/MemSE/nn/op/numba_test.py (9)


File "MemSE/nn/op/numba_test.py", line 9:
def op_numba_c(input):
    sharedI = cuda.shared.array(shape=(cuda.blockDim.x, numba.int32(input.shape[1]), numba.int32(input.shape[2]), numba.int32(input.shape[3]), numba.int32(input.shape[4]), numba.int32(input.shape[5]), numba.int32(input.shape[6])), dtype=numba.float32)

Let me know if I can provide further detail !

Issue Analytics

State:
Created a year ago
Comments:6 (1 by maintainers)

Top GitHub Comments

2reactions

gmarkallcommented, Aug 1, 2022

Dynamic shared memory is needed here - however, for a multi-dimensional array we need to implement reshape, which is tracked by issue #7528. I would like to get that resolved by the next release of Numba.

A couple of other points on the source:

You shouldn’t need to call as_cuda_array() on a Torch tensor - you can pass Torch tensors directly to Numba kernels (Numba internally with all as_cuda_array() on it anyway)
The shape you’re using looks a bit big for shared memory anyway - I think it will use about 168K of shared memory, which is quite a lot. See Table 15 in the Compute Capabilities section of the CUDA Programming Guide - SMs have between 48KB and 164KB maximum shared memory, and you want to enable several blocks to be resident at once, so ideally your shared memory usage should be low enough that several blocks’ worth of shared memory fit into the max available.

It might be worth posting a bit about what you’re trying to implement and asking for suggestions on https://numba.discourse.group if you have more questions about what to do here - I think using shared memory may not be a good fit for your actual use case.

1reaction

guilhermeleobascommented, Aug 2, 2022

Thanks for the input Graham. I’ll close this issue in favor of the discourse topic.