Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Respecting GPU Stream

See original GitHub issue

Customcalls on gpu are passed a stream object in order to perform operations asynchronously. Essentially they are supposed to enqueue operations to an operation buffer.

So we could assume that previous operations are also enqueued to the stream and not necessarily completed.

But our GPU operations do not respect this, and execute Syncronously (immediately), without checking that the stream of previous operations is empty. I don’t know if that is safe or not…

I believe that we should investigate whever we should change the code of MPI_Allreduce and similar to be more like


cdef void mpi_allreduce(cudaStream_t* stream, void** buffers,
                        const char* opaque, size_t opaque_len) nogil except *:

    if COPY_TO_HOST:
        checked_cuda_memcpyAsync(in_buf, data, count, cudaMemcpyDeviceToHost, stream)

    cudaStreamSynchronize(stream)
    mpi_xla_bridge.mpi_allreduce(in_buf, out_buf, nitems, dtype, op, comm, token)

    if COPY_TO_HOST:
        # copy back to device
        checked_cuda_memcpy(out_data, out_buf, count, cudaMemcpyHostToDevice)

where the latter operation could be async, but then we need a callback (CudaEvent) to deallocate the cpu buffer when the memcpy is completed.

Also see https://on-demand.gputechconf.com/gtc/2014/presentations/S4236-multi-gpu-programming-mpi.pdf

However I’m not sure this is really needed. I never noticed data corruption so…

Issue Analytics

State:
Created 3 years ago
Comments:5

Top GitHub Comments

1reaction

dionhaefnercommented, Mar 15, 2021

I wouldn’t bother with async. This just makes everything more complicated for little to no performance gain, since the MPI operations are synchronous anyway. If you’re worried about this, let’s just move the synchronization up a few lines.

0reactions

dionhaefnercommented, Mar 15, 2021

Not this particular one, but I’ve seen many crashes (usually because I messed up some shapes or didn’t pass the correct tokens).

Top Results From Across the Web

NVSHMEM and the CUDA Model

The CUDA runtime respects any stream-order and CUDA event dependencies when assigning work from streams to GPU work queues. NVSHMEM's nonlocal dependencies ...

Stream and Share Your Game With Intel® Arc™ Graphics

An Intel® Deep Link technology, Stream Assist reroutes streaming tasks from the discrete GPU to a separate graphical engine: the integrated GPU.

Multi-Process Service (MPS) of Nvidia GPUs - Medium

A CUDA stream is a sequence of operations that execute on GPU in the order in which ... In Volta, QoS is respected...

GPU Architecture & CUDA Programming

Programming GPUs using the CUDA language ... stream. Primitive Generation. Primitive stream. Input vertex ... scheduling policy that respects resource.

Toward GPU Accelerated Data Stream Processing - CiteSeerX

In recent years, the need for continuous processing and anal- ysis of data streams has increased rapidly. To achieve high.