Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ability to pass structured types by value as kernel parameters

See original GitHub issue

Currently, it is not possible to pass structures by value as kernel arguments. Only some fixed size numpy dtype arguments have been hardcoded yet in cupy/cuda/function.pyx. This implies that a cupy kernel launch only succeeds with numpy.generic arguments which are scalar types with itemsize 1, 2, 4, 8 or 16 bytes. This allows us to pass directly by value np.int8 up to np.complex128 numpy data types as kernel arguments. I would like to extend pass-by-value to any numpy dtype.

I am using RawKernels extensively and I would like to be able to pass cuda builtin vector types such as float3 as kernel parameters. This would make multidimensional code in physics much clearer for example (ie. replace many scalar kernel arguments by vectorized arguments or static arrays of scalars). Depending on device compute capability we can now pass up to 4kB of parameters to cuda kernels, and doing it in 16 bytes chunks is not very efficient.

The definition of the structure layout using numpy.dtype should be the responsibility of the user. This off course implies defining the struct with the same offsets, alignment and itemsize between host and device. pyopencl and pycuda offer utility functions to convert a device side struct to a numpy dtype with matching layout, but this feature would be overkill for now.

Please note that this feature is not related to general structured array support in cupy. It is however already possible to pass a pointer to any array of structures to a RawKernel: simply cast any numpy array to np.int8 data type and let cupy handle np.int8 cuda arrays. Within the kernel, the pointer will be cast to the kernel argument type anyway. This off course only works if host and device structures layout match exactly.

Here is a simple example to illustrate what I would be able to do:

import numpy as np
import cupy as cp

code = r"""
    template <typename T>
    __global__ void affect(T* const __restrict__ output,
                         const T arg0, const T arg1,
                         const size_t nelements) {
        ptrdiff_t i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i >= nelements)
            return;
        output[i] = (i%2 == 0 ? arg0 : arg1);
    }
"""
module = cp.RawModule(code=code,
                      name_expressions=('affect<float3>',),
                      options=('-std=c++11',))
kernel = module.get_function('affect<float3>')

float3 = np.dtype( { 'names': ['x', 'y', 'z'],
                     'formats': [np.float32]*3 } )

arg0 = np.random.rand(3).astype(np.float32).view(float3)
arg1 = np.random.rand(3).astype(np.float32).view(float3)

N = 512
h_output = np.empty(dtype=float3, shape=N)
d_output = cp.asarray(h_output.view(np.float32))

grid = (N,)
block = (256,)
args = (d_output, arg0, arg1, np.uint64(N))
kernel(grid, block, args)

h_output[0::2] = arg0
h_output[1::2] = arg1

np.testing.assert_array_equal(h_output,
                              cp.asnumpy(d_output).view(float3))

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

leofangcommented, Mar 26, 2021

Sorry for my late reply!

I am not sure that your example with void* and MemoryPointer is 100% equivalent.

It is, from the perspective of having a pointer to a contiguous chunk of memory, so

If I understand this correctly you first need to allocate GPU memory with cupy.cuda.memory.alloc, wrap it in a cupy.cuda.MemoryPointer, find some way to initialize this memory on device side with a RawKernel or by raw host->device copy from some initialized host buffer with matching memory layout,

up to here it is correct, with one catch that we don’t care about the memory layout, as literally it’s really just filing up the memory, and so

change your kernel arguments to pointer to required type and then dereference those pointers in your actual code:
    template <typename T>
    __global__ void affect(T* const __restrict__ output,
                         const T* arg0, const T* arg1,  // <-- void* can be replaced by T* here
                         const size_t nelements)

this is not needed. We really just need to pass a void*, and then cast it to whatever needed:

template <typename T>
__global__ void my_ker(void* a, ...) {
    T* b = (T*)a;
    ....
}

I was not aware of MemoryPointer, this indeed makes it possible in the actual version of the library and opens the way for general structured array support in cupy. In order for this to work cupy memory allocators will have to support custom alignement contraints for custom types array allocations.

Yes, I think your ongoing PR #4829 opens this door! 🙂

I am however not aware of hostside __constant__ memory management in cupy (cudaGetSymbolAddress, cudaGetSymbolSize, cudaMemcpyToSymbol, cudaMemcpyFromSymbol).

I am not sure why, but most likely you don’t need them. Suppose you declare __constant__ memory in cupy.RawModule, you can use get_global() to retrieve the pointer: https://github.com/cupy/cupy/blob/cd2550f188be264086ff0d0420d706f6e0460f5a/cupy/core/raw.pyx#L499 And that’s actually all you need to make it work. I think in this regard CUDA has some redundant API design. Presumably they are the residuals from older CUDA?

In PyOpenCL, there is a dictionary that tracks device side typenames and corresponding numpy dtype. At module initialization, this dictionary is populated with scalar and vector builtins. A user can register a new type with get_or_register_dtype(typename, dtype). The most interesting part happens in match_dtype_to_c_struct where you can provide a typename and a numpy dtype as input. It then proceeds to generate the corresponding C structure declaration (OpenCL code) and calls a kernel to determine device side struct memory layout (see cupy/tests/cupy_tests/core_tests/test_function.py::test_custom_user_struct in my PR). As an output you get the generated struct code as well as a device compatible numpy dtype.

I think that such a type dataset should be device dependent, even for CUDA. I am not sure about this, but I imagine that devices with different compute capabilities may impose different structure memory layouts. If you point me where I could implement this in CuPy, I may have the time to implement such a feature later this month.

Thanks for the pointers. I took a closer look at the PyOpenCL utilities and I think they are the right tools to add! 👍 But supporting this may require some thoughts (likely we need to determine the scope of work first). I am a bit occupied recently, so please forgive to revisit this at a later time…The core devs might also shed some light here?

0reactions

keckjcommented, Mar 27, 2021

this is not needed. We really just need to pass a void*, and then cast it to whatever needed:

I just meant that it is easier to introduce the underlying type of the global array directly in the arguments of the function, there is no need to define an additional pointer variable and to manually cast the input pointer later on. Or maybe I am missing something.

I am not sure why, but most likely you don’t need them. Suppose you declare constant memory in cupy.RawModule, you can use get_global() to retrieve the pointer:

Yes you made me realize that you can cudaMemcpy directly to the symbol if you know its size and obtain the pointer with get_global (https://stackoverflow.com/questions/15984913/cudamemcpytosymbol-vs-cudamemcpy-why-is-it-still-around-cudamemcpytosymbol/15985093).

Yes, I think your ongoing PR #4829 opens this door! slightly_smiling_face

When do you think this will get merged ?

But supporting this may require some thoughts (likely we need to determine the scope of work first). I am a bit occupied recently, so please forgive to revisit this at a later time…The core devs might also shed some light here?

Basically you need to implement a global mapping tracking (dtypes, ctypes) populated with scalar/vector types at module initialization, a Cuda struct code generator from a NumPy structured dtype and a kernel code generator to get members offsets. Off course this requires some planning, so lets get some core dev point of view on the subject.

Top Results From Across the Web

The kernel's command-line parameters

The kernel parses parameters from the kernel command line up to “–”; if it doesn't recognize a parameter and it doesn't contain a...

11. Data Types in the Kernel - Linux Device Drivers ... - O'Reilly

Data types used by kernel data are divided into three main classes: standard C types such as int , explicitly sized types such...

Chapter 4. Kernel programming: data types and device memory

As with regular C functions, kernel functions accept arguments by value or by reference. When you pass data by value, you provide actual...

Kernel Parameter - an overview | ScienceDirect Topics

The proc file system provides detailed information about kernel, processes, and configuration parameters in a structured manner under the /proc directory.

Is it worthwhile to pass kernel parameters via shared memory?

data , a , b , and c are all passed by value to each block in either shared memory or constant memory...