question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Launch CUDA kernels from cfuncs

See original GitHub issue

Feature request

I would like a nopython @cfunc to be able to launch a @cuda.jit kernel.

import numba.types as nt

@numba.cuda.jit(nt.void(nt.CPointer(nt.float32), nt.CPointer(nt.float32))
def kernel_gpu(input, output):
  i = numba.cuda.grid(1)
  output[i] = input[i] + 1

@numba.cfunc(nt.void(nt.voidptr, nt.CPointer(nt.voidptr), nt.CPointer(nt.char), nt.ulong))
def kernel_launcher(stream, buffers, opaque, opaque_len):
  blockspergrid, threadsperblock = some_calculation(opaque, opaque_len)
  kernel_gpu[blockspergrid, threadsperblock, stream](buffers[0], buffers[1])

I’ve been doing some experiments in JAX issue 1870 which is about allowing Numba CUDA kernels to be consumed by JAX’s XLA JIT. Most of the pieces are there to put this together, but launching CUDA kernels is proving problematic.

We can already integrate Numba CPU kernels; it’s simply a matter of creating a @cfunc with the right signature and patching it into the JAX API.

The most basic thing we would need to be able to do to launch a Numba CUDA kernel is get its handle. I found kernel_gpu[1,10]._func.get().handle which I thought might be a sufficient hacky way of doing that, but it turns out that that broke sometime between Numba 0.48 and 0.52. It’s clear that that isn’t a public API, so JAX shouldn’t consume it or whatever the 0.52 equivalent is.

I believe that that means that this JAX feature cannot happen without a Numba change to expose an appropriate API.

If a @cfunc were able to launch a CUDA kernel then all of this handle wrangling could be avoided. Both the CPU and GPU kernels could be written as simple @cfuncs; the GPU ones would simply launch a CUDA kernel. This would allow JAX to neatly integrate with Numba without having to write CUDA code to handle launching.

@seibert mentioned on the mailing list that this is a feature Numba would like to implement, and mentioned that it may also have the advantage of reducing repeated kernel launch overhead.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
stuartarchibaldcommented, Dec 2, 2020

That all looks great! I look forward to having this supported officially and will continue hacking away at internals in the meantime because it’s fun.

Great, let us know how you get on!

I think that, whilst unfortunate in terms of time, the “standardised” approach is to enable the above, anything else would just be relying on Numba internals and would unlikely be safe or friendly for users/library authors.

Do you think that it might at least make sense to provide a public API for getting the kernel handle? Say for example, the result of @cuda.jit could have a get_handle()? @cfunc provides something similar so it might be justifiable.

Probably a question for @gmarkall as code owner/maintainer for the CUDA target.

I understand that’s probably not worth doing anytime soon, and the proper thing for me to do is just build against a specific version until this is all supported properly.

One thing I noticed is that you took stream out of the invocation in your example. Did you omit that for a reason? I believe that using custom streams will be required for JAX integration.

No reason, I just forgot to type it 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Writing CUDA Kernels - Numba
Kernel instantiation is done by taking the compiled kernel function (here increment_by_one ) and indexing it with a tuple of integers. Running the...
Read more >
Launching the GPU kernel — CUDA training materials ...
CUDA kernels . Now we learned how to interact with CUDA API, we can ask the GPU to execute a code. GPU is...
Read more >
CUDA Kernel API - Numba documentation
Kernel declaration . The @cuda.jit decorator is used to create a CUDA dispatcher object that can be configured and launched:.
Read more >
CUDA kernels not launching before CudaDeviceSynchronize
This is the expected behavior on Windows with the WDDM driver model, where the driver tries to mitigate the kernel launch overhead by...
Read more >
DYNAMIC PARALLELISM IN CUDA - Nvidia
Dynamic Parallelism in CUDA 5.0 enables a CUDA kernel to create and synchronize new nested work, using the CUDA runtime API to launch...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found