question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Explicitly offloading nogil code to device/GPU

See original GitHub issue

Allow marking nogil-code to be run on a device/GPU.

There is also a discussion on the mailing-list: https://mail.python.org/pipermail/cython-devel/2020-January/005262.html.

Older CEPs suggest doing similar things by

A first and already very powerful step would be to explicitly mark code that should be offloaded, minimize language extensions and not require extra tools.

As a start, we could consider only parallel devices, such as GPUs and use OpenMP target since Cython already uses OpenMP for parallelism

Let’s consider a simple example for computing pairwise distances between vectors in parallel:

def pairwise_distance_host(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    cdef int i,j,k
    with nogil:
        for i in prange(M):
            for j in prange(M):
                d = 0.0
                for k in range(N):
                    tmp = X[i, k] - X[j, k]
                    d = d + tmp * tmp  # cython would interpret '+=' as a reduction variable!
                    D[i, j] = sqrt(d)
    return np.asarray(D)

The parallel region is implicitly defined by the first, outermost prange. For offloading we could demand that the parallel region needs to be defined explicitly:

...
    with nogil, parallel():
        for i in prange(M):
...

Now all we need is a marker that the parallel region should be offloaded to a device:

...
    with nogil, parallel(device={}):
        for i in prange(M):
...

Cython should take care of a safe way to define data mappings: transferring the necessary data to the device and from device to the host:

  • by default arrays are sent from host to device when entering the parallel region and from the device to host when exiting
  • read-only data is only sent from host to device but not from device to host
  • ideally, write-only data will not be sent from host to device but only from device to host
  • ideally, we can also detect that a variable is not used outside the parallel region so that we do not need to transfer any data (only allocate and deallocate)

Because complex indexing can make it impossible to correctly determine the best mapping and because data-movement is often the biggest performance bottleneck, we also need a way for experts to optimize the data movement. For that, device should accept a dictionary mapping variable names to a map-type (as borrowed from OpenMP target map clauses)

  • "to" means host to device
  • "from" means device to host
  • "tofrom" means host to device and device to host
  • "alloc" means not data-transfer at all

In the above example, the input array/memview is read-only on the device, so we could indicate it like this:

...
    with nogil, parallel(device={X:'to'}):
        for i in prange(M):
...

Map-values provided in device overwrite whatever Cython would automatically infer.

Another common challenge in offloading is that computation might go back and forth between host and GPU. In such cases it is often required to keep data on the GPU between different GPU regions even if a host-section is in between. As an example, let’s look at the above code and block the computation by only computing a single row of the output array at once. Note that this will be needed anyway if the input array becomes large since the output vector size increases with quadratically and might simply not fit on the GPU.

def pairwise_distance_row(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    cdef double[::1] Dslice
    cdef int i,j,k
    with nogil:
        for i in range(M):
            Dslice = D[i,:]
            with parallel(device={Dslice:'from', X:'to'}):
                for j in prange(M):
                    d = 0.0
                    for k in range(N):
                        tmp = X[i, k] - X[j, k]
                        d = d + tmp * tmp  # cython would interpret '+=' as a reduction variable!
                        Dslice[j] = sqrt(d)
    return np.asarray(D)  

Even though we only transfer slices of D in Dslice from device to host, the entire input array X will be send from host to device in every iteration of the outermost loop. The suggested solution adds a data-context (to be used with with) defining the lifetime of variables on the device. Let’s simply use the same keyword device and let it accept the same mappings:

...
    with nogil, device({X:'to'}):
        for i in range(M):
...

Since this is an expert tool we might not want or need to infer any map-type and leave it to the programmer.

Calling functions in a device block

The OpenMP compiler will try to inline a function that appears in a target/device section/block and will usually complain if that’s not possible. For such cases Cython provides the decorate @cython.device to explicitly make functions available on the device::

    @cython.device
    cdef double _dist(double[:] v1, double[:] v2) nogil:
        cdef double d = 0.0
        cdef double tmp
        for k in range(min(v1.shape[0], v1.shape[0])):
            tmp = v1[k] - v2[k]
            d += tmp * tmp  # cython would interpret '+=' as a reduction variable!
        return sqrt(d)

    def pairwise_distance_target_row_context_annotated_func(double[:, ::1] X):
        cdef int M = X.shape[0]
        cdef double tmp, d
        cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
        cdef double[::1] Dslice
        cdef int i,j,k
        with device({X:'to'}):
            with nogil:
                for i in range(M):
                    Dslice = D[i,:]
                    with parallel(device={Dslice:'from'}):
                        for j in prange(M):
                            Dslice[j] = _dist(X[i], X[j])
        return np.asarray(D)

A first prototype is implemented here: https://github.com/fschlimb/cython/tree/offload

Limitations, open questions etc

  • In most cases the generated code does not work if @boundscheck and @wraparound are not set to False.
  • OpenMP does not allow mapping overlapping memory regions. We need to at least check that 2 memviews do not overlap
  • C-pointers are not properly checked
  • Only C-contiguous memviews are supported
  • need support for setup.py/distutils
  • need tests
  • need docu (syntax, semantics, and how to setup offload compiler)
  • error reporting back to host is disabled, properly mapping related variables could allow a useful error reporting
  • string support has not been looked at yet

@ogrisel @GaelVaroquaux @oleksandr-pavlyk @DrTodd13

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:4
  • Comments:17 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
fschlimbcommented, Mar 13, 2020

I also added the keyword simd for prange which will add it to the openmp pragma (without any further check)

1reaction
fschlimbcommented, Mar 13, 2020

@jeremiedbb Sorry. The branch is now back online.

Read more comments on GitHub >

github_iconTop Results From Across the Web

InvalidArgumentError- was explicitly assigned to /device:GPU ...
I am following an instruction in github (https://github.com/experiencor/keras-yolo3) to learn object detection by YOLO-3. after running code ...
Read more >
nvidia geforce gtx: Topics by Science.gov
We have developed a Gravitational Oct-Tree code accelerated by HIerarchical ... This study clearly demonstrated GtxA-N as a vaccine antigen able of inducing ......
Read more >
network attached gpu
The code base for V-Ray GPU differs from the code base for the CPU engine. ... And as a media server specifically, the...
Read more >
Full text of "Linux Voice" - Internet Archive
T here's a lot of renewed emphasis on learning to code, and a great deal of noise being made ... there's very little...
Read more >
Release the GIL - speice.io
The with nogil context manager explicitly unlocks the CPython GIL while active. Whenever Cython code runs inside a with nogil block on a ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found