Explicitly offloading nogil code to device/GPU
See original GitHub issueAllow marking nogil-code to be run on a device/GPU.
There is also a discussion on the mailing-list: https://mail.python.org/pipermail/cython-devel/2020-January/005262.html.
Older CEPs suggest doing similar things by
- either automatically determining regions for offload using OpenCL (https://github.com/cython/cython/wiki/enhancements-opencl) which is relatively tricky and intriduces a new tool
- or pretty extensively extend the Cython language (https://github.com/cython/cython/wiki/enchancements-metadefintions)
A first and already very powerful step would be to explicitly mark code that should be offloaded, minimize language extensions and not require extra tools.
As a start, we could consider only parallel devices, such as GPUs and use OpenMP target
since Cython already uses OpenMP for parallelism
Let’s consider a simple example for computing pairwise distances between vectors in parallel:
def pairwise_distance_host(double[:, ::1] X):
cdef int M = X.shape[0]
cdef int N = X.shape[1]
cdef double tmp, d
cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
cdef int i,j,k
with nogil:
for i in prange(M):
for j in prange(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d = d + tmp * tmp # cython would interpret '+=' as a reduction variable!
D[i, j] = sqrt(d)
return np.asarray(D)
The parallel region is implicitly defined by the first, outermost prange
. For offloading we could demand that the parallel region needs to be defined explicitly:
...
with nogil, parallel():
for i in prange(M):
...
Now all we need is a marker that the parallel region should be offloaded to a device:
...
with nogil, parallel(device={}):
for i in prange(M):
...
Cython should take care of a safe way to define data mappings: transferring the necessary data to the device and from device to the host:
- by default arrays are sent from host to device when entering the parallel region and from the device to host when exiting
- read-only data is only sent from host to device but not from device to host
- ideally, write-only data will not be sent from host to device but only from device to host
- ideally, we can also detect that a variable is not used outside the parallel region so that we do not need to transfer any data (only allocate and deallocate)
Because complex indexing can make it impossible to correctly determine the best mapping and because data-movement is often the biggest performance bottleneck, we also need a way for experts to optimize the data movement. For that, device
should accept a dictionary mapping variable names to a map-type (as borrowed from OpenMP target map
clauses)
"to"
means host to device"from"
means device to host"tofrom"
means host to device and device to host"alloc"
means not data-transfer at all
In the above example, the input array/memview is read-only on the device, so we could indicate it like this:
...
with nogil, parallel(device={X:'to'}):
for i in prange(M):
...
Map-values provided in device
overwrite whatever Cython would automatically infer.
Another common challenge in offloading is that computation might go back and forth between host and GPU. In such cases it is often required to keep data on the GPU between different GPU regions even if a host-section is in between. As an example, let’s look at the above code and block the computation by only computing a single row of the output array at once. Note that this will be needed anyway if the input array becomes large since the output vector size increases with quadratically and might simply not fit on the GPU.
def pairwise_distance_row(double[:, ::1] X):
cdef int M = X.shape[0]
cdef int N = X.shape[1]
cdef double tmp, d
cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
cdef double[::1] Dslice
cdef int i,j,k
with nogil:
for i in range(M):
Dslice = D[i,:]
with parallel(device={Dslice:'from', X:'to'}):
for j in prange(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d = d + tmp * tmp # cython would interpret '+=' as a reduction variable!
Dslice[j] = sqrt(d)
return np.asarray(D)
Even though we only transfer slices of D
in Dslice
from device to host, the entire input array X
will be send from host to device in every iteration of the outermost loop. The suggested solution adds a data-context (to be used with with
) defining the lifetime of variables on the device. Let’s simply use the same keyword device
and let it accept the same mappings:
...
with nogil, device({X:'to'}):
for i in range(M):
...
Since this is an expert tool we might not want or need to infer any map-type and leave it to the programmer.
Calling functions in a device block
The OpenMP compiler will try to inline a function that appears in a target/device section/block and will usually
complain if that’s not possible. For such cases Cython provides the decorate @cython.device
to explicitly make
functions available on the device::
@cython.device
cdef double _dist(double[:] v1, double[:] v2) nogil:
cdef double d = 0.0
cdef double tmp
for k in range(min(v1.shape[0], v1.shape[0])):
tmp = v1[k] - v2[k]
d += tmp * tmp # cython would interpret '+=' as a reduction variable!
return sqrt(d)
def pairwise_distance_target_row_context_annotated_func(double[:, ::1] X):
cdef int M = X.shape[0]
cdef double tmp, d
cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
cdef double[::1] Dslice
cdef int i,j,k
with device({X:'to'}):
with nogil:
for i in range(M):
Dslice = D[i,:]
with parallel(device={Dslice:'from'}):
for j in prange(M):
Dslice[j] = _dist(X[i], X[j])
return np.asarray(D)
A first prototype is implemented here: https://github.com/fschlimb/cython/tree/offload
Limitations, open questions etc
- In most cases the generated code does not work if
@boundscheck
and@wraparound
are not set to False. - OpenMP does not allow mapping overlapping memory regions. We need to at least check that 2 memviews do not overlap
- C-pointers are not properly checked
- Only C-contiguous memviews are supported
- need support for setup.py/distutils
- need tests
- need docu (syntax, semantics, and how to setup offload compiler)
- error reporting back to host is disabled, properly mapping related variables could allow a useful error reporting
- string support has not been looked at yet
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:17 (10 by maintainers)
Top GitHub Comments
I also added the keyword
simd
for prange which will add it to the openmp pragma (without any further check)@jeremiedbb Sorry. The branch is now back online.