Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Double Mem functionality with batch schedulers toil-cwl-runner

See original GitHub issue

Referencing #2042 here. This is less of an issue and more of an enhancement. On batch systems such as HPC environments, it would be incredibly helpful to have a --doublemem flag for workflows.

In short, the doublemem flag would be passed at runtime on the first instantiation of the workflow. This flag would tell toil to capture failure reasons of jobs, and if the failure reason corresponded to a TERM_MEMLIMIT or TERM_ENOMEM, the specific step that failed would be retried with double the initial requested memory. This would be helpful for genomics workflows with specific steps that did not have well-defined memory profiles (I think most commonly seen with Structural Variants or even STAR).

An example of this would be something like:

steps:
  bwa:
    requirements:
      ResourceRequirement:
        ramMax: 5000
...

This would likely fail on about 50% of bwa jobs because bwa can use anywhere from 4-10G depending on samples / version of bwa. by running this with --doublemem, this would capture failures due to TERM_MEMLIMIT (this might be LSF specific), so for your 50% of failed jobs, they would automatically be restarted with 10G, making the section look like:

steps:
  bwa:
    requirements:
      ResourceRequirement:
        ramMax: 10000
...

Do you guys have thoughts on this?

Thanks, Dennis

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-689

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

mr-ccommented, Nov 4, 2020

Fixed in https://github.com/DataBiosphere/toil/pull/3313 ; thank you @drkennetz !

0reactions

drkennetzcommented, Oct 15, 2020

@adamnovak I have a working POC although I think it could be implemented better. I added the --doubleMem flag in common.py in the appropriate section. In lsf.py I am returning a different exit code (117 random so as not to interfere) if the exit_reason is “TERM_MEMLIMIT”: line 159 of lsf.py

                             if "TERM_MEMLIMIT" in exit_reason:
                                return 117

Then I have added another entry to BatchJobExitReason “MEMLIMIT”. Then in leader.py in method _gatherUpdatedJobs I say:

        if exitStatus == 1117:
            exitReason = BatchJobExitReason.MEMLIMIT # random number that won't be present on other systems

Then finally in jobGraph.py I added some additional logic for the case where the exit reason is MEMLIMIT and the doubleMem flag has been set:

        if exitReason == BatchJobExitReason.MEMLIMIT and config.doubleMem:
            self._memory = self.memory * 2
            logger.warning("We have doubled the memory of the failed job %s due to doubleMem flag and job failure",
                           self, self.memory)

I don’t think this is the most elegant solution, but it works and would be reproducible on other batchSystems with a MEMLIMIT feature. Basically to reproduce for other batch schedulers:

Catch the MEMLIMIT failure
return 1117 where that exit code is returned, instead of the default exit code.

If you think this is sufficient or just want to take a look I’m happy to PR.

Top Results From Across the Web

Toil Documentation

If set, batch jobs which die due to reaching memory limit on batch schedulers will have their memory doubled and they will be...

toil/cwl.rst at master · DataBiosphere/toil - GitHub

The toil-cwl-runner command provides cwl-parsing functionality using cwltool, and leverages the job-scheduling and batch system support of Toil.

Ubuntu Manpage: toil - Toil Documentation

Running CWL Locally The toil-cwl-runner command provides cwl-parsing functionality using cwltool, and leverages the job-scheduling and batch system support ...

Toil Documentation - Read the Docs

The cwl extra provides support for running workflows written using the ... you can start running Toil jobs, using the Mesos batch system...

bd2k-genomics-toil/Lobby - Gitter

So I have a javascript process that runs the toil-cwl-runner. I saw that I can specify various batchsystems so I was hoping that...