Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

--targetTime for Slurm? Cactus/Toil impacting Slurm

See original GitHub issue

Hello Toil/Cactus community!

I rewrote a bit of toil/batchSystems/slurm.py, and by doing so, I was able to resolve #2323; with my patch, Toil is able to effectively immediately hand off jobs to Slurm, and Slurm is now the bottleneck.

Unfortunately, this appears to be causing some problems with our Slurm installation; it may be more than it can handle. I haven’t been given too many details, but from what I’ve been told, the problems stem from cactus/toil launching thousands and thousands of very short jobs. These short jobs are less than 30 seconds long, and they are apparently straining the system. I’ve been told that my current cactus run launched 250,000 jobs that were less than 30 seconds long on Sunday.

One of our HPC’s system administrators suggested we try --targetTime, but we’ve looked into it, and it appears that code is only applicable if you’re using auto-scaling clusters (e.g. AWS). He’s now looking to see if there’s something he could do (presumably some Toil code he could tweak / write) to alleviate the problem.

Would it be possible to port the --targetTime code over to Toil’s Slurm / BatchSystem code? If so, is this something you could do? If it’s not something you could soon, do you think it would be difficult for us to do? I took a quick glance at it, and it doesn’t look trivial. I haven’t looked deep enough yet to determine whether it would be doable. If you have any advice on how to move forward, we would love to hear it.

For reference, here’s the help on --targetTime:

  --targetTime TARGETTIME
                        Sets how rapidly you aim to complete jobs in seconds.
                        Shorter times mean more aggressive parallelization.
                        The autoscaler attempts to scale up/down so that it
                        expects all queued jobs will complete within
                        targetTime seconds. default=1800

Of course, a solution does not necessarily require that --targetTime be ported over to the batchSystems code. Anything that could reduce the number of small jobs would work. We would love to hear any suggestions you might have.

Thank you for your time!!! Jason

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-382

Issue Analytics

State:
Created 4 years ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

adamnovakcommented, Jun 18, 2019

I’ve used Mesos without a provisioner with the (old) Azure cluster template setup, where you deploy a fixed-sized Mesos-managed cluster. It worked fine for me. Toil’s provisioning logic is pretty well self-contained, and Mesos doesn’t need a provisioner of its own, or anything cloud-specific. It may need more user permissions than your cluster will give you, however, depending on how you want it set up. For example, I think you can set it to sandbox jobs with cgroups, but it can’t do that if it can’t control cgroups.

I think we don’t list Mesos with the HPC batch systems because it isn’t really an HPC batch system. There’s no submission command that Toil runs to talk to it; we interface directly with its HTTP API.

Not that many people actually have Mesos clusters, either, so maybe that’s why Cactus doesn’t mention it.

On 6/18/19, Jason Sydes notifications@github.com wrote:

Thanks for the response @adamnovak! I met with @michaelkarlcoleman today, and we’re looking at two or three different solutions, and running Mesos on top of Slurm is one of them.

Before we go too deep down that Mesos/Slurm path, could we ask, do you expect Mesos to run OK independently of a provisioner. In other words, do you expect (have you tested?) running Mesos without AWS / GoogleCloud? Are there provisioner libraries or APIs that Mesos depends upon that will cause it to fail if we run it outside of a cloud system?

I’m certainly seeing in various places (documentation, Github Issues, peeking cactus/toil code) that Cactus will run with Mesos if it’s running alongside AWS / Google Cloud. But, it seems less apparent that Mesos is supported by itself?

For example, on this old documentation page, I see you can deploy Toil on: High Performance Computing Environments + GridEngine + Apache Mesos + Parasol + Individual multi-core machines . But newer documentation doesn’t list Mesos among the other HPC batch systems (i.e. LSF, SLURM, GridEngine, Parasol, and Torque), and Mesos isn’t listed on the Cactus home page at all. (Our cluster is down for scheduled maintenance today, so I can’t check, but I believe I remember cactus --help saying that “–batchSystem mesos” was an option).

Thank you for your time!! Jason

– You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/DataBiosphere/toil/issues/2712#issuecomment-503255719

0reactions

adamnovakcommented, Jan 12, 2022

Cactus now has a way to run with much larger individual jobs, and --targetTime can’t do what this issue wanted it to do, so I am going to close this out.

@jasonsydes We’d still be interested in your #2323 fix if you have it handy.