digest code for inputs & tasks to inform call caching
See original GitHub issueprecursor to #308 call caching
Call caching will work by recording (probably in a SQLite db), for each successful completed task call, a digest code of the task source code + inputs, and the output JSON. Then when we’re newly asked to run a task on given inputs, compute the digest code and query the database to see if we have a previous run with the same one (and all the output files still exist).
Digesting the inputs should be pretty easy: convert the inputs WDL.Env.Values
to a dict with WDL.values_to_json
, write them out to a JSON string with lexicographically ordered keys, and run a generic digest algorithm on that string.
Digesting tasks will be interesting. Ideally, we’d like the digest code to ignore trivial changes to the source code like whitespace, comments, and the order of declarations; while of course detecting any other meaningful changes to the task. That stated we can begin with something simpler, like digesting the substring of the .wdl file constituting the task source (the range of line & column numbers can be found from the pos
attribute of the Task
object).
Tasks are self-contained except for for the definition of any WDL struct types used therein. So ultimately the task digest would need to cover those struct type definitions as well as the task source code.
Later we’ll also want to be able to similarly digest entire workflows, which would need to cover the workflow source code as well as all called tasks (or subworkflows) and any struct types used.
Is there a way to achieve all this without needing to write a specialized digest method for every single AST node class? TBD.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top GitHub Comments
Here’s the WDL spec for structs btw to help orient: https://github.com/openwdl/wdl/blob/master/versions/1.0/SPEC.md#struct-definition
@MDunitz fyi