Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Explicitly offloading nogil code to device/GPU

See original GitHub issue

Allow marking nogil-code to be run on a device/GPU.

There is also a discussion on the mailing-list: https://mail.python.org/pipermail/cython-devel/2020-January/005262.html.

Older CEPs suggest doing similar things by

either automatically determining regions for offload using OpenCL (https://github.com/cython/cython/wiki/enhancements-opencl) which is relatively tricky and intriduces a new tool
or pretty extensively extend the Cython language (https://github.com/cython/cython/wiki/enchancements-metadefintions)

A first and already very powerful step would be to explicitly mark code that should be offloaded, minimize language extensions and not require extra tools.

As a start, we could consider only parallel devices, such as GPUs and use OpenMP target since Cython already uses OpenMP for parallelism

Let’s consider a simple example for computing pairwise distances between vectors in parallel:

def pairwise_distance_host(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    cdef int i,j,k
    with nogil:
        for i in prange(M):
            for j in prange(M):
                d = 0.0
                for k in range(N):
                    tmp = X[i, k] - X[j, k]
                    d = d + tmp * tmp  # cython would interpret '+=' as a reduction variable!
                    D[i, j] = sqrt(d)
    return np.asarray(D)

The parallel region is implicitly defined by the first, outermost prange. For offloading we could demand that the parallel region needs to be defined explicitly:

...
    with nogil, parallel():
        for i in prange(M):
...

Now all we need is a marker that the parallel region should be offloaded to a device:

...
    with nogil, parallel(device={}):
        for i in prange(M):
...

Cython should take care of a safe way to define data mappings: transferring the necessary data to the device and from device to the host:

by default arrays are sent from host to device when entering the parallel region and from the device to host when exiting
read-only data is only sent from host to device but not from device to host
ideally, write-only data will not be sent from host to device but only from device to host
ideally, we can also detect that a variable is not used outside the parallel region so that we do not need to transfer any data (only allocate and deallocate)

Because complex indexing can make it impossible to correctly determine the best mapping and because data-movement is often the biggest performance bottleneck, we also need a way for experts to optimize the data movement. For that, device should accept a dictionary mapping variable names to a map-type (as borrowed from OpenMP target map clauses)

"to" means host to device
"from" means device to host
"tofrom" means host to device and device to host
"alloc" means not data-transfer at all

In the above example, the input array/memview is read-only on the device, so we could indicate it like this:

...
    with nogil, parallel(device={X:'to'}):
        for i in prange(M):
...

Map-values provided in device overwrite whatever Cython would automatically infer.

Another common challenge in offloading is that computation might go back and forth between host and GPU. In such cases it is often required to keep data on the GPU between different GPU regions even if a host-section is in between. As an example, let’s look at the above code and block the computation by only computing a single row of the output array at once. Note that this will be needed anyway if the input array becomes large since the output vector size increases with quadratically and might simply not fit on the GPU.

def pairwise_distance_row(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    cdef double[::1] Dslice
    cdef int i,j,k
    with nogil:
        for i in range(M):
            Dslice = D[i,:]
            with parallel(device={Dslice:'from', X:'to'}):
                for j in prange(M):
                    d = 0.0
                    for k in range(N):
                        tmp = X[i, k] - X[j, k]
                        d = d + tmp * tmp  # cython would interpret '+=' as a reduction variable!
                        Dslice[j] = sqrt(d)
    return np.asarray(D)

Even though we only transfer slices of D in Dslice from device to host, the entire input array X will be send from host to device in every iteration of the outermost loop. The suggested solution adds a data-context (to be used with with) defining the lifetime of variables on the device. Let’s simply use the same keyword device and let it accept the same mappings:

...
    with nogil, device({X:'to'}):
        for i in range(M):
...

Since this is an expert tool we might not want or need to infer any map-type and leave it to the programmer.

Calling functions in a device block

The OpenMP compiler will try to inline a function that appears in a target/device section/block and will usually complain if that’s not possible. For such cases Cython provides the decorate @cython.device to explicitly make functions available on the device::

    @cython.device
    cdef double _dist(double[:] v1, double[:] v2) nogil:
        cdef double d = 0.0
        cdef double tmp
        for k in range(min(v1.shape[0], v1.shape[0])):
            tmp = v1[k] - v2[k]
            d += tmp * tmp  # cython would interpret '+=' as a reduction variable!
        return sqrt(d)

    def pairwise_distance_target_row_context_annotated_func(double[:, ::1] X):
        cdef int M = X.shape[0]
        cdef double tmp, d
        cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
        cdef double[::1] Dslice
        cdef int i,j,k
        with device({X:'to'}):
            with nogil:
                for i in range(M):
                    Dslice = D[i,:]
                    with parallel(device={Dslice:'from'}):
                        for j in prange(M):
                            Dslice[j] = _dist(X[i], X[j])
        return np.asarray(D)

A first prototype is implemented here: https://github.com/fschlimb/cython/tree/offload

Limitations, open questions etc

In most cases the generated code does not work if @boundscheck and @wraparound are not set to False.
OpenMP does not allow mapping overlapping memory regions. We need to at least check that 2 memviews do not overlap
C-pointers are not properly checked
Only C-contiguous memviews are supported
need support for setup.py/distutils
need tests
need docu (syntax, semantics, and how to setup offload compiler)
error reporting back to host is disabled, properly mapping related variables could allow a useful error reporting
string support has not been looked at yet

@ogrisel @GaelVaroquaux @oleksandr-pavlyk @DrTodd13

Issue Analytics

State:
Created 4 years ago
Reactions:4
Comments:17 (10 by maintainers)

Top GitHub Comments

1reaction

fschlimbcommented, Mar 13, 2020

I also added the keyword simd for prange which will add it to the openmp pragma (without any further check)

1reaction

fschlimbcommented, Mar 13, 2020

@jeremiedbb Sorry. The branch is now back online.

Top Results From Across the Web

InvalidArgumentError- was explicitly assigned to /device:GPU ...

I am following an instruction in github (https://github.com/experiencor/keras-yolo3) to learn object detection by YOLO-3. after running code ...

nvidia geforce gtx: Topics by Science.gov

We have developed a Gravitational Oct-Tree code accelerated by HIerarchical ... This study clearly demonstrated GtxA-N as a vaccine antigen able of inducing ......

network attached gpu

The code base for V-Ray GPU differs from the code base for the CPU engine. ... And as a media server specifically, the...

Full text of "Linux Voice" - Internet Archive

T here's a lot of renewed emphasis on learning to code, and a great deal of noise being made ... there's very little...

Release the GIL - speice.io

The with nogil context manager explicitly unlocks the CPython GIL while active. Whenever Cython code runs inside a with nogil block on a ......