question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Double Mem functionality with batch schedulers toil-cwl-runner

See original GitHub issue

Referencing #2042 here. This is less of an issue and more of an enhancement. On batch systems such as HPC environments, it would be incredibly helpful to have a --doublemem flag for workflows.

In short, the doublemem flag would be passed at runtime on the first instantiation of the workflow. This flag would tell toil to capture failure reasons of jobs, and if the failure reason corresponded to a TERM_MEMLIMIT or TERM_ENOMEM, the specific step that failed would be retried with double the initial requested memory. This would be helpful for genomics workflows with specific steps that did not have well-defined memory profiles (I think most commonly seen with Structural Variants or even STAR).

An example of this would be something like:

steps:
  bwa:
    requirements:
      ResourceRequirement:
        ramMax: 5000
...

This would likely fail on about 50% of bwa jobs because bwa can use anywhere from 4-10G depending on samples / version of bwa. by running this with --doublemem, this would capture failures due to TERM_MEMLIMIT (this might be LSF specific), so for your 50% of failed jobs, they would automatically be restarted with 10G, making the section look like:

steps:
  bwa:
    requirements:
      ResourceRequirement:
        ramMax: 10000
...

Do you guys have thoughts on this?

Thanks, Dennis

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-689

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
mr-ccommented, Nov 4, 2020
0reactions
drkennetzcommented, Oct 15, 2020

@adamnovak I have a working POC although I think it could be implemented better. I added the --doubleMem flag in common.py in the appropriate section. In lsf.py I am returning a different exit code (117 random so as not to interfere) if the exit_reason is “TERM_MEMLIMIT”: line 159 of lsf.py

                             if "TERM_MEMLIMIT" in exit_reason:
                                return 117

Then I have added another entry to BatchJobExitReason “MEMLIMIT”. Then in leader.py in method _gatherUpdatedJobs I say:

        if exitStatus == 1117:
            exitReason = BatchJobExitReason.MEMLIMIT # random number that won't be present on other systems

Then finally in jobGraph.py I added some additional logic for the case where the exit reason is MEMLIMIT and the doubleMem flag has been set:

        if exitReason == BatchJobExitReason.MEMLIMIT and config.doubleMem:
            self._memory = self.memory * 2
            logger.warning("We have doubled the memory of the failed job %s due to doubleMem flag and job failure",
                           self, self.memory)

I don’t think this is the most elegant solution, but it works and would be reproducible on other batchSystems with a MEMLIMIT feature. Basically to reproduce for other batch schedulers:

  1. Catch the MEMLIMIT failure
  2. return 1117 where that exit code is returned, instead of the default exit code.

If you think this is sufficient or just want to take a look I’m happy to PR.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Toil Documentation
If set, batch jobs which die due to reaching memory limit on batch schedulers will have their memory doubled and they will be...
Read more >
toil/cwl.rst at master · DataBiosphere/toil - GitHub
The toil-cwl-runner command provides cwl-parsing functionality using cwltool, and leverages the job-scheduling and batch system support of Toil.
Read more >
Ubuntu Manpage: toil - Toil Documentation
Running CWL Locally The toil-cwl-runner command provides cwl-parsing functionality using cwltool, and leverages the job-scheduling and batch system support ...
Read more >
Toil Documentation - Read the Docs
The cwl extra provides support for running workflows written using the ... you can start running Toil jobs, using the Mesos batch system...
Read more >
bd2k-genomics-toil/Lobby - Gitter
So I have a javascript process that runs the toil-cwl-runner. I saw that I can specify various batchsystems so I was hoping that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found