Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

digest code for inputs & tasks to inform call caching

See original GitHub issue

precursor to #308 call caching

Call caching will work by recording (probably in a SQLite db), for each successful completed task call, a digest code of the task source code + inputs, and the output JSON. Then when we’re newly asked to run a task on given inputs, compute the digest code and query the database to see if we have a previous run with the same one (and all the output files still exist).

Digesting the inputs should be pretty easy: convert the inputs WDL.Env.Values to a dict with WDL.values_to_json, write them out to a JSON string with lexicographically ordered keys, and run a generic digest algorithm on that string.

Digesting tasks will be interesting. Ideally, we’d like the digest code to ignore trivial changes to the source code like whitespace, comments, and the order of declarations; while of course detecting any other meaningful changes to the task. That stated we can begin with something simpler, like digesting the substring of the .wdl file constituting the task source (the range of line & column numbers can be found from the pos attribute of the Task object).

Tasks are self-contained except for for the definition of any WDL struct types used therein. So ultimately the task digest would need to cover those struct type definitions as well as the task source code.

Later we’ll also want to be able to similarly digest entire workflows, which would need to cover the workflow source code as well as all called tasks (or subworkflows) and any struct types used.

Is there a way to achieve all this without needing to write a specialized digest method for every single AST node class? TBD.