question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CWL] Needs method to avoid copying files when a shared filesystem is available

See original GitHub issue

Paths of File and Directory objects residing on an identified shared file system should be exempted from copying and just passed directly to the node.

To identify these shared filesystem a cwltoil command line option specifying the root path could be added.

Inspiration:

hcraT @hcraT 12:59 Hi I’m writing a pipeline in cwl to run some analyses on biological data. I was able to use toil to run some tests. However I realized that toil copies my input files that are quite big. It is possible to avoid this behavior? I run my jobs on a batch system. I actually fed in as input to the pipeline symbolic links to the actual files. Thanks

https://gitter.im/bd2k-genomics-toil/Lobby?at=58f09d818fcce56b20ff77e4

Pavlo Lutsik @lutsik 14:08 I echo @hcraT. I probably have a very similar gridEngine-based setup, and copying of large files was a pain. I also implemented a similar workaround: simply eliminated the File type, working with string paths only. I had to adapt all the cwl wrappers and write my own cleanup steps, but the payoff in terms of performance was tremendous.

https://gitter.im/bd2k-genomics-toil/Lobby?at=58f0ada6f22385553d3e440f

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:2
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
chapmanbcommented, May 2, 2017

@cket and all – I’ve been doing some larger scale tests with cwltoil and am also running into what I believe is the same issue. Steps like variantcalling are very slow to spin up because it appears to be copying the input BAM files over to each step before running.

I’m happy to help try to dig into this but am not sure about where to start:

  • Is copying everything to the isolated run directory the expected behavior of Toil? Or is this specific to the CWL implementation?
  • The later, how does Toil decide when to copy and when to re-use an existing local file?
  • Is it possible to also do something similar for S3 filestores on AWS, to avoid downloading and staging multiple times on the same machine?

Thanks for any suggestions and pointers.

0reactions
ejacoxcommented, Aug 25, 2017

@evan-wehi and @chapmanb I think this deserves a new issue. I created #1846.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Access envDef of Workflow in CWLtool - CWL Questions
Hello, I would like to know whether is there any way to access the Environment Variable defined in workflow from inside a tool/step...
Read more >
Common Workflow Language (CWL) Command Line Tool ...
The implementation may use a shared or distributed file system or transfer files via explicit download to the host. Implementations may choose not...
Read more >
Maintaining and versioning CWL on external tool repositories
This tutorial will guide you through using two open source tools for working with CWL, and you will need to be comfortable with...
Read more >
Best Practices for writing CWL - Arvados
A tool may fail when attempting to rename or delete a file in the output directory. This may happen because files listed in...
Read more >
Quickstart Examples — Toil 3.10.0 documentation
Usually, a workflow will generate files, and Toil needs a place to keep track of things. ... Copy and paste the following code...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found