Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handling pre-emption on slurm with fairtask?

See original GitHub issue

I’d like to run a large number of jobs on scavenge that can handle pre-emption. Do I need to modify any hydra/fairtask config for this? Here a MWE example of me trying to get 6_sweep to restart when pre-empted that I’m having some trouble with:

We can add import time; time.sleep(1e6) to experiment.py and then run ./experiment.py -m. We can see this job on the cluster:

And I have a dask dashboard for it:

I then send a USR1 signal to my job, which according to https://our.internmc.facebook.com/intern/wiki/FAIR/Platforms/FAIRClusters/SLURMGuide/ is what gets sent for pre-emptions:

$ scancel --signal=USR1 4817474

But then my job just gets killed and never comes back online:

And I can see in the logs that USR1 my job got the USR1 signal but I’m not sure the best way of triggering a restart when this happens:

6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.* -n 100
==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.err <==
distributed.nanny - INFO -         Start Nanny at: 'tcp://100.97.16.233:34063'
distributed.diskutils - INFO - Found stale lock file and directory '/private/home/bda/.fairtask/dask-worker-space/worker-_z3oli3u', purging
distributed.worker - INFO -       Start worker at:  tcp://100.97.16.233:42249
distributed.worker - INFO -          Listening to:  tcp://100.97.16.233:42249
distributed.worker - INFO - Waiting to connect to:  tcp://100.97.17.198:41779
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         10
distributed.worker - INFO -                Memory:                   64.00 GB
distributed.worker - INFO -       Local Directory: /private/home/bda/.fairtask/dask-worker-space/worker-6z34zo8f
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://100.97.17.198:41779
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/private/home/bda/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
  len(cache))
srun: error: learnfair087: task 0: User defined signal 1

==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.out <==
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
  lr: 0.001
  type: nesterov

6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/0_4817474/UNKNOWN_NAME.log -n 100
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
  lr: 0.001
  type: nesterov

Issue Analytics

State:
Created 4 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

calebhocommented, Aug 29, 2019

It may be because your versions of dask* and distributed are incompatible; fairtask was written when both were v1. Let me double check. If it turns out to be incompatible, I’ll open an issue in fairtask to bump the versions to v2

1reaction

omrycommented, Aug 29, 2019

@bamos, if the conclusion if this investigation is that we get a new job directory on preemption please file an issue against Hydra. The re-queued job should run in the same directory to allow resume from checkpoint.

Top Results From Across the Web

Preemption - Slurm Workload Manager - SchedMD

Based on the configuration the preempted job(s) can be cancelled, or can be requeued and started using other resources, or suspended and resumed ......

Preemption - High Performance Computing Facility - UMBC

Preemption is a scheduling mechanism that allows for the suspension of some running jobs (preempted) by other pending jobs (preemptors). SLURM uses the...

SLURM Preemption - Research Computing Documentation

SLURM Preemption. Preemption ... by specifying the partition and the "preempt" QOS (example: sbatch --partition=margres_2020 --qos=preempt .

Flexible Scheduling of Distributed Analytic Applications - Eurecom

Abstract—This work addresses the problem of scheduling user-defined analytic applications, which we define as high-level.

Flexible Scheduling of Distributed Analytic ... - ieeecomputer.org

has been introduced to facilitate the processing of bulk data. ... 1This approach is similar to the one used in the SLURM scheduler...