Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to use ucx protocol for the communication between workers and schedulers

See original GitHub issue

It seems that the dask.distributed has supported the ucx protocol for the communications between workers and schedulers, which seems to have large advantages over tcp when equipped with infiniband. How can I use that with jobqueue? It seems not a hard thing because jobqueue is based on dask.distributed. If I add --protocol ucx option for scheduler and worker command, would that be ok ?

Issue Analytics

State:
Created 4 years ago
Comments:14 (13 by maintainers)

Top GitHub Comments

1reaction

andersy005commented, May 20, 2021

I did. Things worked, but weren’t yet any faster. The Dask + UCX team within RAPIDS (which @quasiben leads) is working on profiling and performance now, so hopefully we’ll see some larger speedups soon.

As @quasiben states above, I did this just by adding the protocol="ucx://" keyword to the FooCluster classes.

@mrocklin, could you point me to the setup you used on Cheyenne/Casper? I’ve been trying to launch a dask cluster with ucx protocol for communication. All my attempts have failed

Running the following

cluster = PBSCluster(protocol="ucx://", env_extra=["export UCX_TLS=tcp,sockcm", 
                                                    "export UCX_SOCKADDR_TLS_PRIORITY=sockcm", 
                                                    'export UCXPY_IFNAME="ib0"'])

Results in a timeout error.

RuntimeError                              Traceback (most recent call last)
<ipython-input-4-5ed5e6609592> in <module>
----> 1 cluster = PBSCluster(protocol="ucx://", env_extra=["export UCX_TLS=tcp,sockcm", 
      2                                                     "export UCX_SOCKADDR_TLS_PRIORITY=sockcm",
      3                                                     'export UCXPY_IFNAME="ib0"'])
      4 client = Client(cluster)
      5 cluster

/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/dask_jobqueue/core.py in __init__(self, n_workers, job_cls, loop, security, silence_logs, name, asynchronous, dashboard_address, host, scheduler_options, interface, protocol, config_name, **job_kwargs)
    528         self._dummy_job  # trigger property to ensure that the job is valid
    529 
--> 530         super().__init__(
    531             scheduler=scheduler,
    532             worker=worker,

/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close)
    280         if not self.asynchronous:
    281             self._loop_runner.start()
--> 282             self.sync(self._start)
    283             self.sync(self._correct_state)
    284 

/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    186             return future
    187         else:
--> 188             return sync(self.loop, func, *args, **kwargs)
    189 
    190     def _log(self, log):

/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    352     if error[0]:
    353         typ, exc, tb = error[0]
--> 354         raise exc.with_traceback(tb)
    355     else:
    356         return result[0]

/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/utils.py in f()
    335             if callback_timeout is not None:
    336                 future = asyncio.wait_for(future, callback_timeout)
--> 337             result[0] = yield future
    338         except Exception as exc:
    339             error[0] = sys.exc_info()

/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/deploy/spec.py in _start(self)
    319             self.status = Status.failed
    320             await self._close()
--> 321             raise RuntimeError(f"Cluster failed to start. {str(e)}") from e
    322 
    323     def _correct_state(self):

RuntimeError: Cluster failed to start. Timed out trying to connect to ucx://10.12.206.47:49153 after 10 s

I tried launching the scheduler from the command line, and I ran into a different error:

(dask-gpu) bash-4.2$ dask-scheduler --protocol ucx
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
[1621471211.546747] [casper-login1:254743:0]    ucp_context.c:735  UCX  WARN  network device 'mlx5_0:1' is not available, please use one or more of: 'ext'(tcp), 'ib0'(tcp), 'mgt'(tcp)
[1621471211.546766] [casper-login1:254743:0]    ucp_context.c:1071 UCX  ERROR no usable transports/devices (asked tcp,sockcm on network:mlx5_0:1 )
Traceback (most recent call last):
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 208, in main
    loop.run_sync(run)
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 204, in run
    await scheduler
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/core.py", line 285, in _
    await self.start()
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/scheduler.py", line 3678, in start
    await self.listen(
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/core.py", line 400, in listen
    listener = await listen(
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/comm/core.py", line 208, in _
    await self.start()
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/comm/ucx.py", line 404, in start
    init_once()
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/comm/ucx.py", line 76, in init_once
    ucp.init(options=ucx_config, env_takes_precedence=True)
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/ucp/core.py", line 766, in init
    _ctx = ApplicationContext(options, blocking_progress_mode=blocking_progress_mode)
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/ucp/core.py", line 234, in __init__
    self.context = ucx_api.UCXContext(config_dict)
  File "ucp/_libs/ucx_api.pyx", line 295, in ucp._libs.ucx_api.UCXContext.__init__
  File "ucp/_libs/ucx_api.pyx", line 107, in ucp._libs.ucx_api.assert_ucs_status
ucp.exceptions.UCXError: No such device

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/bin/dask-scheduler", line 11, in <module>
    sys.exit(go())
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 217, in go
    main()
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/click/core.py", line 1134, in __call__
    return self.main(*args, **kwargs)
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/click/core.py", line 1059, in main
    rv = self.invoke(ctx)
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/click/core.py", line 1401, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/click/core.py", line 767, in invoke
    return __callback(*args, **kwargs)
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/cli/dask_scheduler.py", line 212, in main
    logger.info("End scheduler at %r", scheduler.address)
  File "/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/core.py", line 359, in address
    raise ValueError("cannot get address of non-running Server")
ValueError: cannot get address of non-running Server

Am I making a trivial error, or do I need to do some extra setup for things to work properly?

Ccing @quasiben in case he has some suggestions, too.

0reactions

ocaisacommented, May 28, 2021

I should add here that we also tested this a few months ago and found it to give no performance benefit (at least in our use case). We also found that it kills resilience, though this may have since changed.