question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handling pre-emption on slurm with fairtask?

See original GitHub issue

I’d like to run a large number of jobs on scavenge that can handle pre-emption. Do I need to modify any hydra/fairtask config for this? Here a MWE example of me trying to get 6_sweep to restart when pre-empted that I’m having some trouble with:

We can add import time; time.sleep(1e6) to experiment.py and then run ./experiment.py -m. We can see this job on the cluster:

And I have a dask dashboard for it:

image

I then send a USR1 signal to my job, which according to https://our.internmc.facebook.com/intern/wiki/FAIR/Platforms/FAIRClusters/SLURMGuide/ is what gets sent for pre-emptions:

image

$ scancel --signal=USR1 4817474

But then my job just gets killed and never comes back online:

image

And I can see in the logs that USR1 my job got the USR1 signal but I’m not sure the best way of triggering a restart when this happens:

6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.* -n 100
==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.err <==
distributed.nanny - INFO -         Start Nanny at: 'tcp://100.97.16.233:34063'
distributed.diskutils - INFO - Found stale lock file and directory '/private/home/bda/.fairtask/dask-worker-space/worker-_z3oli3u', purging
distributed.worker - INFO -       Start worker at:  tcp://100.97.16.233:42249
distributed.worker - INFO -          Listening to:  tcp://100.97.16.233:42249
distributed.worker - INFO - Waiting to connect to:  tcp://100.97.17.198:41779
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         10
distributed.worker - INFO -                Memory:                   64.00 GB
distributed.worker - INFO -       Local Directory: /private/home/bda/.fairtask/dask-worker-space/worker-6z34zo8f
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://100.97.17.198:41779
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
/private/home/bda/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
  len(cache))
srun: error: learnfair087: task 0: User defined signal 1

==> /checkpoint/bda/outputs/2019-08-28_08-26-42/.slurm/slurm-4817474.out <==
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
  lr: 0.001
  type: nesterov

6_sweep(master*)$ tail /checkpoint/bda/outputs/2019-08-28_08-26-42/0_4817474/UNKNOWN_NAME.log -n 100
[2019-08-28 08:27:10,572][__main__][INFO] - optimizer:
  lr: 0.001
  type: nesterov

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
calebhocommented, Aug 29, 2019

It may be because your versions of dask* and distributed are incompatible; fairtask was written when both were v1. Let me double check. If it turns out to be incompatible, I’ll open an issue in fairtask to bump the versions to v2

1reaction
omrycommented, Aug 29, 2019

@bamos, if the conclusion if this investigation is that we get a new job directory on preemption please file an issue against Hydra. The re-queued job should run in the same directory to allow resume from checkpoint.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Preemption - Slurm Workload Manager - SchedMD
Based on the configuration the preempted job(s) can be cancelled, or can be requeued and started using other resources, or suspended and resumed ......
Read more >
Preemption - High Performance Computing Facility - UMBC
Preemption is a scheduling mechanism that allows for the suspension of some running jobs (preempted) by other pending jobs (preemptors). SLURM uses the...
Read more >
SLURM Preemption - Research Computing Documentation
SLURM Preemption. Preemption ... by specifying the partition and the "preempt" QOS (example: sbatch --partition=margres_2020 --qos=preempt .
Read more >
Flexible Scheduling of Distributed Analytic Applications - Eurecom
Abstract—This work addresses the problem of scheduling user-defined analytic applications, which we define as high-level.
Read more >
Flexible Scheduling of Distributed Analytic ... - ieeecomputer.org
has been introduced to facilitate the processing of bulk data. ... 1This approach is similar to the one used in the SLURM scheduler...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found