In-place updates of files in CWL don't seem to work when run manually without caching
See original GitHub issueOn Courtyard, with commit 5b70540708973d0116d73605ee7daa3b171481cf (current master
branch), when I run this one particular test from the CWL 1.1 conformance tests, I get the wrong answer when caching is off:
git clone https://github.com/common-workflow-language/cwl-v1.1.git ~/build/toil/src/toil/test/cwl/spec_v11 && cd ~/build/toil/src/toil/test/cwl/spec_v11 && git checkout 664835e83eb5e57eee18a04ce7b05fb9d70d77b7
toil-cwl-runner --disableCaching=True --clean=always --outdir=/tmp/tmp7fb6na8h --workDir=/tmp/adamnovak-toil/test --quiet tests/inp_update_wf.cwl tests/empty.json
(I need to put the --workDir
on local storage to work around https://github.com/common-workflow-language/cwltool/issues/1405.)
It thinks and logs for a minute or so and finally outputs:
{
"a": 3,
"b": 4
}
If I change to --disableCaching False
, I get the answer that the conformance tests actually check for:
{
"a": 4,
"b": 4
}
The a
answer is generated by a job that looks at the output file of a prior job, and is supposed to be waiting on another successor of that prior job to have made an in-place update. With caching off, either that update is not happening or it is not visible (in time?) to the job that wants to see it.
I’m not sure why this doesn’t come up in the CI tests before; it has started failing on CI in the branch for #3323. My current theory is that when I changed the number of tests to run in parallel (which we had been misoverestimating due to seeing all the Kubernetes host cores but only having access to some of them), it changed the usual outcome of a race condition.
@DailyDreaming Any idea what’s going on here? Can you replicate this issue when you run this test?
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-807
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
I feel like we’re missing some code to detect and reupload files that CWL jobs tried to modify in place, and to get the new IDs for those files to other jobs that expect to see those modifications just by virtue of their execution order constraints.
If we don’t have that, how do we expect modified files to make it from one Mesos or Kubernetes node to another? We can’t commit the modification back to the original file ID.
We’ve completely rewritten CWL filesystem support since this happened, and we now have
--bypass-file-store
to enable the in-place update requirement that CWL offers.