[BUG] `Scontrol` Error when checkpointing / preemption on slurm
See original GitHub issueHi,
For me, submitit works great when there is no need of checkpointing / preemption but I have the following error when I need to checkpoint:
FileNotFoundError: [Errno 2] No such file or directory: 'scontrol'
Specifically, I can reproduce this error by running docs/mnist.py
, I ran the following three version of the mnist example to understand the issue:
- Running
docs/mnist.py
on slurm as is, I get the previous error. Full logs: stderr , stdout - If I ssh into some slurm node that I get allocated to and run
docs/mnist.py
on the local executer (cluster="local"
) everything works as it should: so submitit + checkpointing works fine. - Running
docs/mnist.py
but without preemption ( removingtimeout_min
andjob._interrupt()
) everything works fine: so slurm + submitit work fine.
Also scontrol
seems to work fine on my login node, so I don’t understand why the check_call(["scontrol", "requeue", jid])
does not work. That being said, Scontrol
does not work on the nodes I get allocated to (it only works from the login nodes) but from my understanding check_call(["scontrol", "requeue", jid])
is called from where I call submitit and thus not having scontrol
on the allocated nodes shouldn’t be an issue, am I correct?
Thank you !
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
I meant adding
export PATH="$PATH:/opt/slurm/bin"
. E.g. in your~/.bashrc
Thanks Jeremey, I found a way around it.
In case someone has this issue in the future I was able to solve it by adding slurm to my path (in my case
/opt/slurm/bin
).