question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Respecting GPU Stream

See original GitHub issue

Customcalls on gpu are passed a stream object in order to perform operations asynchronously. Essentially they are supposed to enqueue operations to an operation buffer.

So we could assume that previous operations are also enqueued to the stream and not necessarily completed.

But our GPU operations do not respect this, and execute Syncronously (immediately), without checking that the stream of previous operations is empty. I don’t know if that is safe or not…

I believe that we should investigate whever we should change the code of MPI_Allreduce and similar to be more like


cdef void mpi_allreduce(cudaStream_t* stream, void** buffers,
                        const char* opaque, size_t opaque_len) nogil except *:

    if COPY_TO_HOST:
        checked_cuda_memcpyAsync(in_buf, data, count, cudaMemcpyDeviceToHost, stream)

    cudaStreamSynchronize(stream)
    mpi_xla_bridge.mpi_allreduce(in_buf, out_buf, nitems, dtype, op, comm, token)

    if COPY_TO_HOST:
        # copy back to device
        checked_cuda_memcpy(out_data, out_buf, count, cudaMemcpyHostToDevice)

where the latter operation could be async, but then we need a callback (CudaEvent) to deallocate the cpu buffer when the memcpy is completed.

Also see https://on-demand.gputechconf.com/gtc/2014/presentations/S4236-multi-gpu-programming-mpi.pdf

However I’m not sure this is really needed. I never noticed data corruption so…

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
dionhaefnercommented, Mar 15, 2021

I wouldn’t bother with async. This just makes everything more complicated for little to no performance gain, since the MPI operations are synchronous anyway. If you’re worried about this, let’s just move the synchronization up a few lines.

0reactions
dionhaefnercommented, Mar 15, 2021

Not this particular one, but I’ve seen many crashes (usually because I messed up some shapes or didn’t pass the correct tokens).

Read more comments on GitHub >

github_iconTop Results From Across the Web

NVSHMEM and the CUDA Model
The CUDA runtime respects any stream-order and CUDA event dependencies when assigning work from streams to GPU work queues. NVSHMEM's nonlocal dependencies ...
Read more >
Stream and Share Your Game With Intel® Arc™ Graphics
An Intel® Deep Link technology, Stream Assist reroutes streaming tasks from the discrete GPU to a separate graphical engine: the integrated GPU.
Read more >
Multi-Process Service (MPS) of Nvidia GPUs - Medium
A CUDA stream is a sequence of operations that execute on GPU in the order in which ... In Volta, QoS is respected...
Read more >
GPU Architecture & CUDA Programming
Programming GPUs using the CUDA language ... stream. Primitive Generation. Primitive stream. Input vertex ... scheduling policy that respects resource.
Read more >
Toward GPU Accelerated Data Stream Processing - CiteSeerX
In recent years, the need for continuous processing and anal- ysis of data streams has increased rapidly. To achieve high.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found