Respecting GPU Stream
See original GitHub issueCustomcalls on gpu are passed a stream object in order to perform operations asynchronously. Essentially they are supposed to enqueue operations to an operation buffer.
So we could assume that previous operations are also enqueued to the stream and not necessarily completed.
But our GPU operations do not respect this, and execute Syncronously (immediately), without checking that the stream of previous operations is empty. I don’t know if that is safe or not…
I believe that we should investigate whever we should change the code of MPI_Allreduce
and similar to be more like
cdef void mpi_allreduce(cudaStream_t* stream, void** buffers,
const char* opaque, size_t opaque_len) nogil except *:
if COPY_TO_HOST:
checked_cuda_memcpyAsync(in_buf, data, count, cudaMemcpyDeviceToHost, stream)
cudaStreamSynchronize(stream)
mpi_xla_bridge.mpi_allreduce(in_buf, out_buf, nitems, dtype, op, comm, token)
if COPY_TO_HOST:
# copy back to device
checked_cuda_memcpy(out_data, out_buf, count, cudaMemcpyHostToDevice)
where the latter operation could be async, but then we need a callback (CudaEvent) to deallocate the cpu buffer when the memcpy is completed.
Also see https://on-demand.gputechconf.com/gtc/2014/presentations/S4236-multi-gpu-programming-mpi.pdf
However I’m not sure this is really needed. I never noticed data corruption so…
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
I wouldn’t bother with async. This just makes everything more complicated for little to no performance gain, since the MPI operations are synchronous anyway. If you’re worried about this, let’s just move the synchronization up a few lines.
Not this particular one, but I’ve seen many crashes (usually because I messed up some shapes or didn’t pass the correct tokens).