[docs] Issue on `deploying-on-slurm.rst`
See original GitHub issueDocumentation Problem/Question/Comment
This SLURM example seems to assume that /tmp
is a directory shared by all the nodes. I think this is not generally true.
While testing this example, I got errors of the kind “could not connect to IPC socket”, because the nodes did not share the same temp directory, which contain the communication sockets. To make it work, I had to set --temp-dir
to a shared location.
(Created directly from the docs)
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Deploying on Slurm — Ray 2.2.0 - the Ray documentation
This document aims to clarify how to run Ray on SLURM. ... The issue is that as soon as the first head node...
Read more >How to properly deploying Ray Tune on a Slurm server for ...
Hello, After a lot of effort, I've managed to get Ray Tune somehow working on a Slurm server for doing distributed hyper parameter...
Read more >Cloud Architecture Center tutorial doesn't seem to work (jobs ...
But it doesn't seem to work. Once I deploy the cluster and try to launch a test job, nothing happens (i.e. there is...
Read more >Known Issues - Determined AI Documentation
On a Slurm cluster, it is common to rely upon /etc/hosts (instead of DNS) to resolve the addresses of the login node and...
Read more >Deploy an Auto-Scaling HPC Cluster with Slurm
Learn how to provision a dynamically scalable HPC cluster using Google Compute Engine, Google Deployment Manager, and the Slurm Workload ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yeah, we need to clean up these docs.
temp_dir
doesn’t need to be shared. In fact, nothing actually needs to be shared; files can be moved around using rsync.Doesn’t the
temp_dir
contain the communication sockets? If that’s the case, the folder obviously needs to be shared.Since I’m not using Ray myself (I was doing support for someone else), I don’t think I’m qualified to drill down on this and/or edit the docs. I’m withdrawing.