Handling pre-emption on slurm with fairtask?
See original GitHub issueI’d like to run a large number of jobs on scavenge that can handle pre-emption. Do I need to modify any hydra/fairtask config for this? Here a MWE example of me trying to get 6_sweep
to restart when pre-empted that I’m having some trouble with:
We can add import time; time.sleep(1e6)
to experiment.py
and then run ./experiment.py -m
. We can see this job on the cluster:
And I have a dask dashboard for it:
I then send a USR1
signal to my job, which according to https://our.internmc.facebook.com/intern/wiki/FAIR/Platforms/FAIRClusters/SLURMGuide/ is what gets sent for pre-emptions:
$ scancel --signal=USR1 4817474
But then my job just gets killed and never comes back online:
And I can see in the logs that USR1 my job got the USR1 signal but I’m not sure the best way of triggering a restart when this happens:
6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.* -n 100
==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.err <==
distributed.nanny - INFO - Start Nanny at: 'tcp://100.97.16.233:34063'
distributed.diskutils - INFO - Found stale lock file and directory '/private/home/bda/.fairtask/dask-worker-space/worker-_z3oli3u', purging
distributed.worker - INFO - Start worker at: tcp://100.97.16.233:42249
distributed.worker - INFO - Listening to: tcp://100.97.16.233:42249
distributed.worker - INFO - Waiting to connect to: tcp://100.97.17.198:41779
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 10
distributed.worker - INFO - Memory: 64.00 GB
distributed.worker - INFO - Local Directory: /private/home/bda/.fairtask/dask-worker-space/worker-6z34zo8f
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://100.97.17.198:41779
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/private/home/bda/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
len(cache))
srun: error: learnfair087: task 0: User defined signal 1
==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.out <==
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
lr: 0.001
type: nesterov
6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/0_4817474/UNKNOWN_NAME.log -n 100
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
lr: 0.001
type: nesterov
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (15 by maintainers)
Top GitHub Comments
It may be because your versions of
dask*
anddistributed
are incompatible; fairtask was written when both were v1. Let me double check. If it turns out to be incompatible, I’ll open an issue in fairtask to bump the versions to v2@bamos, if the conclusion if this investigation is that we get a new job directory on preemption please file an issue against Hydra. The re-queued job should run in the same directory to allow resume from checkpoint.