Non-blocking operations on scheduler
See original GitHub issueHi, this is just a general discussion on the scheduler’s behavior. I noticed that in initialize
:
async def run_scheduler():
async with Scheduler(
interface=interface,
protocol=protocol,
dashboard=dashboard,
dashboard_address=dashboard_address,
) as scheduler:
comm.bcast(scheduler.address, root=0)
comm.Barrier()
await scheduler.finished()
The scheduler only starts with 1 thread, and since the scheduler has to maintain communication with all workers and client, I’m very curious that once there is an await
call, then the schedule will simply block itself there, and only when after that await
, it can start to do other communications. It seems this is a huge overhead in communication.
For example, in distributed.comm.ucx
,
@log_errors
async def write(
self,
msg: dict,
serializers: Collection[str] | None = None,
on_error: str = "message",
) -> int:
if self.closed():
raise CommClosedError("Endpoint is closed -- unable to send message")
try:
if serializers is None:
serializers = ("cuda", "dask", "pickle", "error")
# msg can also be a list of dicts when sending batched messages
logging.info("send msg={}".format(msg))
frames = await to_frames(
msg,
serializers=serializers,
on_error=on_error,
allow_offload=self.allow_offload,
)
nframes = len(frames)
cuda_frames = tuple(hasattr(f, "__cuda_array_interface__") for f in frames)
sizes = tuple(nbytes(f) for f in frames)
cuda_send_frames, send_frames = zip(
*(
(is_cuda, each_frame)
for is_cuda, each_frame in zip(cuda_frames, frames)
if nbytes(each_frame) > 0
)
)
# Send meta data
# Send close flag and number of frames (_Bool, int64)
await self.ep.send(struct.pack("?Q", False, nframes))
# Send which frames are CUDA (bool) and
# how large each frame is (uint64)
await self.ep.send(
struct.pack(nframes * "?" + nframes * "Q", *cuda_frames, *sizes)
)
# Send frames
# It is necessary to first synchronize the default stream before start
# sending We synchronize the default stream because UCX is not
# stream-ordered and syncing the default stream will wait for other
# non-blocking CUDA streams. Note this is only sufficient if the memory
# being sent is not currently in use on non-blocking CUDA streams.
if any(cuda_send_frames):
synchronize_stream(0)
for each_frame in send_frames:
await self.ep.send(each_frame)
return sum(sizes)
except (ucp.exceptions.UCXBaseException):
self.abort()
raise CommClosedError("While writing, the connection was closed")
There are some await self.ep.send
, and for example, this is a send from scheduler to worker dask/dask-mpi#1, then despite that all other workers can perform computation in parallel, they still have to sequentially wait for the communication with the scheduler. And in cases where communication is heavier than computation, the overhead will be significant.
I’m wondering if there is any way to perform nonblocking send/recv by giving the scheduler more threads.
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
Agreed, transferring this issue to distributed.
This seems to be a question more for the Distributed community. Dask-MPI is just the tool for launching a Dask cluster.