question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[docs] Issue on `deploying-on-slurm.rst`

See original GitHub issue

Documentation Problem/Question/Comment

This SLURM example seems to assume that /tmp is a directory shared by all the nodes. I think this is not generally true.

While testing this example, I got errors of the kind “could not connect to IPC socket”, because the nodes did not share the same temp directory, which contain the communication sockets. To make it work, I had to set --temp-dir to a shared location.

(Created directly from the docs)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
richardliawcommented, Feb 11, 2020

Yeah, we need to clean up these docs.

temp_dir doesn’t need to be shared. In fact, nothing actually needs to be shared; files can be moved around using rsync.

0reactions
lemairecarlcommented, Feb 12, 2020

Doesn’t the temp_dir contain the communication sockets? If that’s the case, the folder obviously needs to be shared.

Since I’m not using Ray myself (I was doing support for someone else), I don’t think I’m qualified to drill down on this and/or edit the docs. I’m withdrawing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Deploying on Slurm — Ray 2.2.0 - the Ray documentation
This document aims to clarify how to run Ray on SLURM. ... The issue is that as soon as the first head node...
Read more >
How to properly deploying Ray Tune on a Slurm server for ...
Hello, After a lot of effort, I've managed to get Ray Tune somehow working on a Slurm server for doing distributed hyper parameter...
Read more >
Cloud Architecture Center tutorial doesn't seem to work (jobs ...
But it doesn't seem to work. Once I deploy the cluster and try to launch a test job, nothing happens (i.e. there is...
Read more >
Known Issues - Determined AI Documentation
On a Slurm cluster, it is common to rely upon /etc/hosts (instead of DNS) to resolve the addresses of the login node and...
Read more >
Deploy an Auto-Scaling HPC Cluster with Slurm
Learn how to provision a dynamically scalable HPC cluster using Google Compute Engine, Google Deployment Manager, and the Slurm Workload ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found