[CWL] Needs method to avoid copying files when a shared filesystem is available
See original GitHub issuePaths of File
and Directory
objects residing on an identified shared file system should be exempted from copying and just passed directly to the node.
To identify these shared filesystem a cwltoil
command line option specifying the root path could be added.
Inspiration:
hcraT @hcraT 12:59 Hi I’m writing a pipeline in cwl to run some analyses on biological data. I was able to use toil to run some tests. However I realized that toil copies my input files that are quite big. It is possible to avoid this behavior? I run my jobs on a batch system. I actually fed in as input to the pipeline symbolic links to the actual files. Thanks
https://gitter.im/bd2k-genomics-toil/Lobby?at=58f09d818fcce56b20ff77e4
Pavlo Lutsik @lutsik 14:08 I echo @hcraT. I probably have a very similar gridEngine-based setup, and copying of large files was a pain. I also implemented a similar workaround: simply eliminated the File type, working with string paths only. I had to adapt all the cwl wrappers and write my own cleanup steps, but the payoff in terms of performance was tremendous.
https://gitter.im/bd2k-genomics-toil/Lobby?at=58f0ada6f22385553d3e440f
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:12 (12 by maintainers)
Top GitHub Comments
@cket and all – I’ve been doing some larger scale tests with cwltoil and am also running into what I believe is the same issue. Steps like variantcalling are very slow to spin up because it appears to be copying the input BAM files over to each step before running.
I’m happy to help try to dig into this but am not sure about where to start:
Thanks for any suggestions and pointers.
@evan-wehi and @chapmanb I think this deserves a new issue. I created #1846.