--targetTime for Slurm? Cactus/Toil impacting Slurm
See original GitHub issueHello Toil/Cactus community!
I rewrote a bit of toil/batchSystems/slurm.py, and by doing so, I was able to resolve #2323; with my patch, Toil is able to effectively immediately hand off jobs to Slurm, and Slurm is now the bottleneck.
Unfortunately, this appears to be causing some problems with our Slurm installation; it may be more than it can handle. I haven’t been given too many details, but from what I’ve been told, the problems stem from cactus/toil launching thousands and thousands of very short jobs. These short jobs are less than 30 seconds long, and they are apparently straining the system. I’ve been told that my current cactus run launched 250,000 jobs that were less than 30 seconds long on Sunday.
One of our HPC’s system administrators suggested we try --targetTime, but we’ve looked into it, and it appears that code is only applicable if you’re using auto-scaling clusters (e.g. AWS). He’s now looking to see if there’s something he could do (presumably some Toil code he could tweak / write) to alleviate the problem.
Would it be possible to port the --targetTime code over to Toil’s Slurm / BatchSystem code? If so, is this something you could do? If it’s not something you could soon, do you think it would be difficult for us to do? I took a quick glance at it, and it doesn’t look trivial. I haven’t looked deep enough yet to determine whether it would be doable. If you have any advice on how to move forward, we would love to hear it.
For reference, here’s the help on --targetTime:
--targetTime TARGETTIME
Sets how rapidly you aim to complete jobs in seconds.
Shorter times mean more aggressive parallelization.
The autoscaler attempts to scale up/down so that it
expects all queued jobs will complete within
targetTime seconds. default=1800
Of course, a solution does not necessarily require that --targetTime be ported over to the batchSystems code. Anything that could reduce the number of small jobs would work. We would love to hear any suggestions you might have.
Thank you for your time!!! Jason
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-382
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (5 by maintainers)
Top GitHub Comments
I’ve used Mesos without a provisioner with the (old) Azure cluster template setup, where you deploy a fixed-sized Mesos-managed cluster. It worked fine for me. Toil’s provisioning logic is pretty well self-contained, and Mesos doesn’t need a provisioner of its own, or anything cloud-specific. It may need more user permissions than your cluster will give you, however, depending on how you want it set up. For example, I think you can set it to sandbox jobs with cgroups, but it can’t do that if it can’t control cgroups.
I think we don’t list Mesos with the HPC batch systems because it isn’t really an HPC batch system. There’s no submission command that Toil runs to talk to it; we interface directly with its HTTP API.
Not that many people actually have Mesos clusters, either, so maybe that’s why Cactus doesn’t mention it.
On 6/18/19, Jason Sydes notifications@github.com wrote:
Cactus now has a way to run with much larger individual jobs, and
--targetTime
can’t do what this issue wanted it to do, so I am going to close this out.@jasonsydes We’d still be interested in your #2323 fix if you have it handy.