Double Mem functionality with batch schedulers toil-cwl-runner
See original GitHub issueReferencing #2042 here. This is less of an issue and more of an enhancement. On batch systems such as HPC environments, it would be incredibly helpful to have a --doublemem
flag for workflows.
In short, the doublemem flag would be passed at runtime on the first instantiation of the workflow. This flag would tell toil to capture failure reasons of jobs, and if the failure reason corresponded to a TERM_MEMLIMIT or TERM_ENOMEM, the specific step that failed would be retried with double the initial requested memory. This would be helpful for genomics workflows with specific steps that did not have well-defined memory profiles (I think most commonly seen with Structural Variants or even STAR).
An example of this would be something like:
steps:
bwa:
requirements:
ResourceRequirement:
ramMax: 5000
...
This would likely fail on about 50% of bwa jobs because bwa can use anywhere from 4-10G depending on samples / version of bwa. by running this with --doublemem
, this would capture failures due to TERM_MEMLIMIT (this might be LSF specific), so for your 50% of failed jobs, they would automatically be restarted with 10G, making the section look like:
steps:
bwa:
requirements:
ResourceRequirement:
ramMax: 10000
...
Do you guys have thoughts on this?
Thanks, Dennis
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-689
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
Fixed in https://github.com/DataBiosphere/toil/pull/3313 ; thank you @drkennetz !
@adamnovak I have a working POC although I think it could be implemented better. I added the
--doubleMem
flag incommon.py
in the appropriate section. Inlsf.py
I am returning a different exit code (117 random so as not to interfere) if the exit_reason is “TERM_MEMLIMIT”: line 159 oflsf.py
Then I have added another entry to BatchJobExitReason “MEMLIMIT”. Then in
leader.py
in method_gatherUpdatedJobs
I say:Then finally in
jobGraph.py
I added some additional logic for the case where the exit reason is MEMLIMIT and the doubleMem flag has been set:I don’t think this is the most elegant solution, but it works and would be reproducible on other batchSystems with a MEMLIMIT feature. Basically to reproduce for other batch schedulers:
If you think this is sufficient or just want to take a look I’m happy to PR.