dask-ssh fails if Python is installed in different paths across the workers
See original GitHub issueI tested distributed on a very simple office “cluster” : My laptop and office server. Both are on Ubuntu 14.04, but I installed Python differently on both. In my laptop I did a user install of miniconda and in the server I installed anaconda as root. The corresponding python paths are :
10.1.0.115 --> ‘/home/aguirre/miniconda2/bin/python’ (My laptop) 10.1.0.118 --> ‘/opt/anaconda2/bin/python’ (Server)
If I manually launch dask-worker
and dask-scheduler
, everything works fine. But if I try dask-ssh
, it does not work :
$ dask-ssh 10.1.0.{115,118}
---------------------------------------------------------------
Dask.distributed v1.11.0
Worker nodes:
0: 10.1.0.115
1: 10.1.0.118
scheduler node: 10.1.0.115:8786
---------------------------------------------------------------
[ scheduler 10.1.0.115:8786 ] : /home/aguirre/miniconda2/bin/python -m distributed.cli.dask_scheduler --port 8786
[ worker 10.1.0.115 ] : /home/aguirre/miniconda2/bin/python -m distributed.cli.dask_worker 10.1.0.115:8786 --host 10.1.0.115 --nthreads 0 --nprocs 1
[ worker 10.1.0.118 ] : /home/aguirre/miniconda2/bin/python -m distributed.cli.dask_worker 10.1.0.115:8786 --host 10.1.0.118 --nthreads 0 --nprocs 1
[ scheduler 10.1.0.115:8786 ] : distributed.scheduler - INFO - Scheduler at: 10.1.0.115:8786
[ scheduler 10.1.0.115:8786 ] : distributed.scheduler - INFO - http at: 10.1.0.115:9786
[ scheduler 10.1.0.115:8786 ] : distributed.scheduler - WARNING - Could not start Bokeh web UI
[ scheduler 10.1.0.115:8786 ] : Traceback (most recent call last):
[ scheduler 10.1.0.115:8786 ] : File "/home/aguirre/miniconda2/lib/python2.7/site-packages/distributed/cli/dask_scheduler.py", line 92, in main
[ scheduler 10.1.0.115:8786 ] : bokeh_proc = subprocess.Popen(args)
[ scheduler 10.1.0.115:8786 ] : File "/home/aguirre/miniconda2/lib/python2.7/subprocess.py", line 710, in __init__
[ scheduler 10.1.0.115:8786 ] : errread, errwrite)
[ scheduler 10.1.0.115:8786 ] : File "/home/aguirre/miniconda2/lib/python2.7/subprocess.py", line 1335, in _execute_child
[ scheduler 10.1.0.115:8786 ] : raise child_exception
[ scheduler 10.1.0.115:8786 ] : OSError: [Errno 2] No such file or directory
[ worker 10.1.0.118 ] : bash: /home/aguirre/miniconda2/bin/python: No such file or directory
[ worker 10.1.0.118 ] : remote process exited with exit status 127
As you can see, the worker on 10.1.0.118
tries to call python on the wrong path (/home/aguirre/miniconda2/bin/python
) which happens to be the path of the scheduler (10.1.0.115
)
I took a look at the code and I think the problem lies on the line 189 of cluster.py. It builds the command to be launched by each worker with the path of the node where dask-ssh
was launched. Just to check, I hard-coded the python path of 10.1.0.118
on line 189 of cluster.py, and it correctly launches the worker ! However, it now fails to launch a worker on 10.1.0.115
, which is normal…
BTW, I don’t think the Exception raised by the scheduler (10.1.0.115) is related… it seems that it does not find bokeh in the PATH… However, when I launch the scheduler by itself, it does manage to launch bokeh web UI. But lets handle one problem at a time and focus on the Python PATH bit of my case.
I don’t have many clues on how this could be solved, but with some guidance, I’m willing to give a hand !
Issue Analytics
- State:
- Created 7 years ago
- Comments:15 (9 by maintainers)
Top GitHub Comments
@felipeam86 I see two starting options:
dask/docs/source/...rst
thatdask-ssh
is assuming similar environments, such as you might see on a system with a shared file system.paramiko
and learn how to create a connection that respects user environments. This probably involves some googling, some doc reading, and some experimentation on your own two-machine cluster setup. Then play with the implementation indask/cluster.py
to implement the changes that you needed in order to make things work well in experiments.@hussainsultan @felipeam86 ‘conda run’ was removed without much notice in 4.0.10: https://github.com/conda/conda/issues/2682
There is talk of bringing it back, but I don’t see it in conda 4.1.11.