Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SSHCluster expects conda environment to be at the same path on all systems

See original GitHub issue

What happened: When using SSHCluster on machines with different conda paths things fail.

What you expected to happen: The correct conda environment should be activated.

It seems that the SSHCluster tries to call the python executable directly at the same path that the current Python is running at. It may be more robust to activate the conda environment with the same name as the current one and use the python provided with that. It would also be good to be able to specify a conda environment in the kwargs.

Minimal Complete Verifiable Example:

# On Host A
conda create -n test -p /tmp/condaA python ipython dask
conda activate /tmp/condaA/test

# On Host B
conda create -n test -p /tmp/condaB python ipython dask
conda activate /tmp/condaB/test

# On Host A

from dask.distributed import SSHCluster
cluster = SSHCluster(["localhost", "HostB"]

...
distributed.deploy.ssh - INFO - env: ‘/tmp/condaA/test/bin/python’: No such file or directory
...

Full Example Traceback

This traceback is from a real example to the paths don’t quite match the simplified example above.

distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Clear task state
distributed.deploy.ssh - INFO - distributed.scheduler - INFO -   Scheduler at:   tcp://10.51.100.15:8786
distributed.deploy.ssh - INFO - env: ‘/Users/jtomlinson/miniconda3/envs/coiledstream/bin/python’: No such file or directory
Task exception was never retrieved
future: <Task finished name='Task-45' coro=<_wrap_awaitable() done, defined at /Users/jtomlinson/miniconda3/envs/coiledstream/lib/python3.8/asyncio/tasks.py:677> exception=Exception('Worker failed to start')>
Traceback (most recent call last):
  File "/Users/jtomlinson/miniconda3/envs/coiledstream/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/Users/jtomlinson/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/deploy/spec.py", line 50, in _
    await self.start()
  File "/Users/jtomlinson/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/deploy/ssh.py", line 129, in start
    raise Exception("Worker failed to start")
Exception: Worker failed to start
distributed.deploy.ssh - INFO - env: ‘/Users/jtomlinson/miniconda3/envs/coiledstream/bin/python’: No such file or directory
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-2-dbe5c7142de0> in <module>
----> 1 cluster = SSHCluster(["localhost", "10.51.0.32"], connect_options=[{}, {"username": "jacob"}])

~/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/deploy/ssh.py in SSHCluster(hosts, connect_options, worker_options, scheduler_options, worker_module, remote_python, **kwargs)
    352         for i, host in enumerate(hosts[1:])
    353     }
--> 354     return SpecCluster(workers, scheduler, name="SSHCluster", **kwargs)

~/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name)
    255             self._loop_runner.start()
    256             self.sync(self._start)
--> 257             self.sync(self._correct_state)
    258
    259     async def _start(self):

~/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    167             return future
    168         else:
--> 169             return sync(self.loop, func, *args, **kwargs)
    170
    171     async def _get_logs(self, scheduler=True, workers=True):

~/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    337     if error[0]:
    338         typ, exc, tb = error[0]
--> 339         raise exc.with_traceback(tb)
    340     else:
    341         return result[0]

~/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/utils.py in f()
    321             if callback_timeout is not None:
    322                 future = asyncio.wait_for(future, callback_timeout)
--> 323             result[0] = yield future
    324         except Exception as exc:
    325             error[0] = sys.exc_info()

~/miniconda3/envs/coiledstream/lib/python3.8/site-packages/tornado/gen.py in run(self)
    733
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

~/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/deploy/spec.py in _correct_state_internal(self)
    333                 for w in workers:
    334                     w._cluster = weakref.ref(self)
--> 335                     await w  # for tornado gen.coroutine support
    336             self.workers.update(dict(zip(to_open, workers)))
    337

~/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/deploy/spec.py in _()
     48             async with self.lock:
     49                 if self.status == "created":
---> 50                     await self.start()
     51                     assert self.status == "running"
     52             return self

~/miniconda3/envs/coiledstream/lib/python3.8/site-packages/distributed/deploy/ssh.py in start(self)
    127             line = await self.proc.stderr.readline()
    128             if not line.strip():
--> 129                 raise Exception("Worker failed to start")
    130             logger.info(line.strip())
    131             if "worker at" in line:

Exception: Worker failed to start

Environment: