Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

documenting cupy.cuda.function.Module?

See original GitHub issue

I was working with an Nvidia engineer @laytonjb to help a group of scientists migrate their Python codebase to GPU, and he suggested to use cupy. After some experiments I realized that cupy provides almost identical functionalities as in pycuda.driver.module_from_file, namely to load precompiled cubin (CUDA binary) and grab the kernels therein. We wonder why this great feature is not documented at all (I hope we didn’t miss anything!). IMHO this is a huge attraction for pycuda users (as I am), and for anyone who needs the flexibility of occasionally working with low-level CUDA kernels. Several issues related to JIT compilation (such as #1258, #1398, more recently #1655, etc) could’ve been less urgent if this were documented in the first place.

For people who are looking for this feature, below is the steps I found that worked perfectly for us : suppose we have a file named cupy_mod.cu which is defined as follows

extern "C"{ //avoid C++ name mangling 
//C=A*B, so Ay=Bx
__global__ void mat_mul(double * A, double * B, double * C, int Ax, int Bx, int By) {
   /* implementation goes here */
   }

/* other kernels */
}

then the steps to take is

compile the .cu file to .cubin (CUDA binary) with nvcc -arch=sm_XX -cubin -o cupy_mod.cubin cupy_mod.cu
load it in python

import cupy as cp

# create a Module object in python
mod = cp.cuda.function.Module()

# load the cubin
mod.load_file("/path/to/cupy_mod.cubin")

# fetch the kernel to make it a Python function object
mat_mul_cp = mod.get_function("mat_mul")

# declare A, B, C as 2D cupy arrays of dtype cp.float64
# be sure they are C contiguous arrays!
# ...omitted...

# call the function with a tuple of grid size, a tuple of block size, and a tuple of all arguments required by the kernel
# if the kernel requires shared memory, append `shared_mem=n_bytes` to the function call
mat_mul_cp( ((A.shape[1]+128-1)//128, 0, 0), (128, 0, 0), (A, B, C, cp.int32(A.shape[0]), cp.int32(B.shape[0]), cp.int32(B.shape[1])))

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:11 (10 by maintainers)

Top GitHub Comments

1reaction

leofangcommented, May 28, 2019

It’s in #1889 but is not merged to master yet.

1reaction

leofangcommented, Dec 4, 2018

@kmaehashi ok I’ll try

Top Results From Across the Web

User-Defined Kernels — CuPy 11.4.0 documentation

CuPy provides easy ways to define three types of CUDA kernels: elementwise kernels, reduction kernels and raw kernels. In this documentation, we describe...

cupy.RawModule — CuPy 11.4.0 documentation

This class can be used to either compile raw CUDA sources or load CUDA modules (*.cubin, *.ptx). This class is useful when a...

latest PDF

CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy ...

Basics of CuPy — CuPy 11.4.0 documentation

CuPy is a GPU array backend that implements a subset of NumPy interface. ... NumPy has numpy.linalg.norm() function that calculates it on CPU....

Installation — CuPy 11.4.0 documentation

If you have multiple versions of CUDA Toolkit installed, CuPy will automatically choose one of the CUDA installations. See Working with Custom CUDA...