question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] `Scontrol` Error when checkpointing / preemption on slurm

See original GitHub issue

Hi,

For me, submitit works great when there is no need of checkpointing / preemption but I have the following error when I need to checkpoint: FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'

Specifically, I can reproduce this error by running docs/mnist.py, I ran the following three version of the mnist example to understand the issue:

  • Running docs/mnist.py on slurm as is, I get the previous error. Full logs: stderr , stdout
  • If I ssh into some slurm node that I get allocated to and run docs/mnist.py on the local executer (cluster="local") everything works as it should: so submitit + checkpointing works fine.
  • Running docs/mnist.py but without preemption ( removing timeout_min and job._interrupt()) everything works fine: so slurm + submitit work fine.

Also scontrol seems to work fine on my login node, so I don’t understand why the check_call(["scontrol", "requeue", jid]) does not work. That being said, Scontrol does not work on the nodes I get allocated to (it only works from the login nodes) but from my understanding check_call(["scontrol", "requeue", jid]) is called from where I call submitit and thus not having scontrol on the allocated nodes shouldn’t be an issue, am I correct?

Thank you !

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
YannDubscommented, Jul 7, 2021

I meant adding export PATH="$PATH:/opt/slurm/bin" . E.g. in your ~/.bashrc

1reaction
YannDubscommented, Jan 25, 2021

Thanks Jeremey, I found a way around it.

In case someone has this issue in the future I was able to solve it by adding slurm to my path (in my case /opt/slurm/bin).

Read more comments on GitHub >

github_iconTop Results From Across the Web

scontrol - Used view and modify Slurm configuration and state.
The default preemption mechanism is specified by the cluster-wide PreemptMode configuration parameter. Possible values are "OFF", "CANCEL", "CHECKPOINT", " ...
Read more >
1250 – suspend/requeue does not work
Any attempt to requeue a job ends with either NODE FAILURE error message or "slurmstepd report problem deleting step cgroup path". scontrol ......
Read more >
4754 – Time-sensitive need to raise a user's job priorities
To see who SlurmUser is, use scontrol show config: $ scontrol show ... plan on the possibility of preemption by check-pointing or re-queuing ......
Read more >
1750 – slurmctld segfaults - SchedMD - Slurm Support
If the jobs which can be preempted release CPUs, but not the right CPUs so the pending job can use the CPUs bound...
Read more >
2119 – User cannot submit jobs, invalid account reported
Can you attach your current slurm.conf to the bug? ... Nope - scontrol show assoc gives error: invalid entity: assoc for keyword: show...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found