question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No implementation of CUDA shared.array error

See original GitHub issue

Reporting a bug

  • I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
  • I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.

Problem description

When trying to initialize a 7 dimensional shared array, an error is thrown by the Numba compiler.

MWE

import numba
import torch
import math

from numba import cuda

@cuda.jit
def op_numba_c(input):
    sharedI = cuda.shared.array(shape=(cuda.blockDim.x, numba.int32(input.shape[1]), numba.int32(input.shape[2]), numba.int32(input.shape[3]), numba.int32(input.shape[4]), numba.int32(input.shape[5]), numba.int32(input.shape[6])), dtype=numba.float32)


if __name__ == '__main__':
    device = torch.device('cuda:0')
    input = torch.rand(2, 3, 7, 7, 3, 7, 7).to(device)
    def t2nb(ten):
            return numba.cuda.as_cuda_array(ten)
        
    # TODO fidling with threads
    threadsperblock= (16, 16)
    blockspergrid_x = math.ceil(input.shape[0] / threadsperblock[0])
    blockspergrid_y = math.ceil(input.shape[1] / threadsperblock[1])

    blockspergrid = (blockspergrid_x, blockspergrid_y)
    i_2 = input.clone()
    op_numba_c[blockspergrid, threadsperblock](t2nb(i_2))

Std out

Traceback removed for clarity

numba.core.errors.TypingError: Failed in cuda mode pipeline (step: nopython frontend)
No implementation of function Function(<function shared.array at 0x7fd3668320d0>) found for signature:
 
 >>> array(shape=UniTuple(int32 x 7), dtype=class(float32))
 
There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload of function 'array': File: numba/cuda/cudadecl.py: Line 46.
    With argument(s): '(shape=UniTuple(int32 x 7), dtype=class(float32))':
   No match.

During: resolving callee type: Function(<function shared.array at 0x7fd3668320d0>)
During: typing of call at /home/sebastien/workspace/MemSE/MemSE/nn/op/numba_test.py (9)


File "MemSE/nn/op/numba_test.py", line 9:
def op_numba_c(input):
    sharedI = cuda.shared.array(shape=(cuda.blockDim.x, numba.int32(input.shape[1]), numba.int32(input.shape[2]), numba.int32(input.shape[3]), numba.int32(input.shape[4]), numba.int32(input.shape[5]), numba.int32(input.shape[6])), dtype=numba.float32)

Let me know if I can provide further detail !

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
gmarkallcommented, Aug 1, 2022

Dynamic shared memory is needed here - however, for a multi-dimensional array we need to implement reshape, which is tracked by issue #7528. I would like to get that resolved by the next release of Numba.

A couple of other points on the source:

  • You shouldn’t need to call as_cuda_array() on a Torch tensor - you can pass Torch tensors directly to Numba kernels (Numba internally with all as_cuda_array() on it anyway)
  • The shape you’re using looks a bit big for shared memory anyway - I think it will use about 168K of shared memory, which is quite a lot. See Table 15 in the Compute Capabilities section of the CUDA Programming Guide - SMs have between 48KB and 164KB maximum shared memory, and you want to enable several blocks to be resident at once, so ideally your shared memory usage should be low enough that several blocks’ worth of shared memory fit into the max available.

It might be worth posting a bit about what you’re trying to implement and asking for suggestions on https://numba.discourse.group if you have more questions about what to do here - I think using shared memory may not be a good fit for your actual use case.

1reaction
guilhermeleobascommented, Aug 2, 2022

Thanks for the input Graham. I’ll close this issue in favor of the discourse topic.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot create a shared array in a kernel ... - Numba Discussion
The above code doesn't compile, I'm receiving this error message: ... No implementation of function Function(<function shared.array at ...
Read more >
Numba - Shared memory in CUDA kernel not updating correctly
I'm confused because cuda.shared.array is shared by all threads in a given block, right? How do I accumulate the increments using the same...
Read more >
cuda.local.array produces TypingError even though the shape ...
I've tried defining the variable in the top-level function mat_concatenate() , but get the same error: TypingError: No implementation of ...
Read more >
Memory management — Numba 0.50.1 documentation
Create a DeviceNDArray from any object that implements the cuda array interface. A view of the underlying GPU buffer is created. No copying...
Read more >
CUDA Kernel API - Numba documentation
sharedmem – The number of bytes of dynamic shared memory required by the kernel. ... Creates an array in the local memory space...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found