Make CWL interpreter use symlinks for reads from the file store where possible (no symlinks created for workdir).
See original GitHub issueThis is related to https://github.com/DataBiosphere/toil/issues/1627, https://github.com/DataBiosphere/toil/issues/1846 and fixed in https://github.com/DataBiosphere/toil/pull/1687, https://github.com/DataBiosphere/toil/issues/1786. Symlinks were enabled for the jobStore in cwltoil but not for the workdir. If input data resides on a different device (nfs, second hard drive, docker bind mount etc) it will create a copy rather than a symlink.
I’ll be using the following CWLs to illustrate the issue below.
ls.cwl:
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
baseCommand: ls
inputs:
file: File
outputs:
output: stdout
stdout: stdout.txt
wf.cwl (used to create an intermediate file in a workflow):
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow
inputs:
file: File
steps:
copy:
run:
class: CommandLineTool
baseCommand: cp
inputs:
source:
type: File
inputBinding:
position: 1
dest:
type: string
inputBinding:
position: 2
outputs:
output:
type: File
outputBinding:
glob: $(inputs.dest)
in:
source: file
dest:
default: "copy.file"
out: [output]
ls:
run: ls.cwl
in:
file: copy/output
out: [output]
outputs:
output:
type: File
outputSource: ls/output
Running the following creates the expected symlink in the jobstore. I also found that It will create hard links in workdir when /data
and /tmp
are on the same disk. Is there a reason why there are two hard links to the same file or is this a bug?
$ cwltoil --clean never --cleanWorkDir never --jobStore /tmp/jobstore --workDir /tmp/workdir ls.cwl --file /data/large.file
...
$ readlink /tmp/jobstore/tmp/n/Z/tmp9ibpNt-x-large.file /tmp/workdir/toil-733c92fc-536a-4fec-9d28-1e5abf804c87-f88cf264a5bf4a72bcc41f04517ae518/tmp_S71Qf/182402ec-3108-42a3-b2dc-77f107fa0d23/tmp*.tmp
/data/large.file
/data/large.file
/data/large.file
$ ls -i /tmp/jobstore/tmp/n/Z/tmp9ibpNt-x-large.file /tmp/workdir/toil-733c92fc-536a-4fec-9d28-1e5abf804c87-f88cf264a5bf4a72bcc41f04517ae518/tmp_S71Qf/182402ec-3108-42a3-b2dc-77f107fa0d23/tmp*.tmp
5189 /tmp/jobstore/tmp/n/Z/tmp9ibpNt-x-large.file
5189 /tmp/workdir/toil-733c92fc-536a-4fec-9d28-1e5abf804c87-f88cf264a5bf4a72bcc41f04517ae518/tmp_S71Qf/182402ec-3108-42a3-b2dc-77f107fa0d23/tmpRilmBe.tmp
5189 /tmp/workdir/toil-733c92fc-536a-4fec-9d28-1e5abf804c87-f88cf264a5bf4a72bcc41f04517ae518/tmp_S71Qf/182402ec-3108-42a3-b2dc-77f107fa0d23/tmpqizQgZ.tmp
Creating the hard links will fail when the input file is on a nfs (/nfs
) and it seems to resort to creating copies in the workdir instead. Similar to before, I get two complete copies of the same file in the workdir.
$ cwltoil --clean never --cleanWorkDir never --jobStore /tmp/jobstore --workDir /tmp/workdir ls.cwl --file /nfs/large.file
$ readlink /tmp/jobstore/tmp/e/2/tmp6fgHor-x-large.file
/nfs/large.file
$ ls -i /nfs/large.file /tmp/workdir/toil-c1ad08d6-e752-4dd7-9522-1471ab32993e-f88cf264a5bf4a72bcc41f04517ae518/tmp1WmIRM/bbea6323-e4f8-4f60-b9fe-582621c5ca80/tmp*.tmp
13676546 /nfs/large.file
3879019 /tmp/workdir/toil-c1ad08d6-e752-4dd7-9522-1471ab32993e-f88cf264a5bf4a72bcc41f04517ae518/tmp1WmIRM/bbea6323-e4f8-4f60-b9fe-582621c5ca80/tmpR9drKc.tmp
3879018 /tmp/workdir/toil-c1ad08d6-e752-4dd7-9522-1471ab32993e-f88cf264a5bf4a72bcc41f04517ae518/tmp1WmIRM/bbea6323-e4f8-4f60-b9fe-582621c5ca80/tmpTrCLvg.tmp
$ md5sum /nfs/large.file /tmp/workdir/toil-c1ad08d6-e752-4dd7-9522-1471ab32993e-f88cf264a5bf4a72bcc41f04517ae518/tmp1WmIRM/bbea6323-e4f8-4f60-b9fe-582621c5ca80/tmp*.tmp
ade18cde4adc34fa8c6804fc955c69ef /nfs/large.file
ade18cde4adc34fa8c6804fc955c69ef /tmp/workdir/toil-c1ad08d6-e752-4dd7-9522-1471ab32993e-f88cf264a5bf4a72bcc41f04517ae518/tmp1WmIRM/bbea6323-e4f8-4f60-b9fe-582621c5ca80/tmpR9drKc.tmp
ade18cde4adc34fa8c6804fc955c69ef /tmp/workdir/toil-c1ad08d6-e752-4dd7-9522-1471ab32993e-f88cf264a5bf4a72bcc41f04517ae518/tmp1WmIRM/bbea6323-e4f8-4f60-b9fe-582621c5ca80/tmpTrCLvg.tmp
The most common situation in a cluster environment would be to host the jobstore on a shared filesystem (/nfs
in my example) and keep the workdir on a local scratch space (/tmp
). Based on my previous observation I assumed that intermediate files in a CWL workflow would be copied to the workdir but it fails completely when the input data is on a separate filesystem from the jobstore.
$ cwltoil --retryCount 0 --jobStore /nfs/jobstore/test --workDir /tmp/workdir wf.cwl --file /data/large.file
INFO:cwltool:Resolved 'wf.cwl' to 'file:///home/forsmark/wf.cwl'
WARNING:toil.batchSystems.singleMachine:Limiting maxCores to CPU count of system (4).
WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to physically available memory (33731055616).
WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically available disk (35152494592). INFO:toil:Running Toil version 3.18.0-84239d802248a5f4a220e762b3b8ce5cc92af0be.
INFO:toil.leader:Issued job 'file:///home/forsmark/wf.cwl#copy/ddb166bf-0825-46da-912d-75af8116b927' cp b/h/jobtOprVm with job batch system ID: 1 and cores: 1, disk: 3.0 G, and memory: 2.0 G
INFO:toil.leader:Job ended successfully: 'file:///home/forsmark/wf.cwl#copy/ddb166bf-0825-46da-912d-75af8116b927' cp b/h/jobtOprVm
WARNING:toil.leader:The job seems to have left a log file, indicating failure: 'file:///home/forsmark/wf.cwl#copy/ddb166bf-0825-46da-912d-75af8116b927' cp b/h/jobtOprVm
WARNING:toil.leader:b/h/jobtOprVm INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:b/h/jobtOprVm INFO:toil:Running Toil version 3.18.0-84239d802248a5f4a220e762b3b8ce5cc92af0be.
WARNING:toil.leader:b/h/jobtOprVm Got workflow error
WARNING:toil.leader:b/h/jobtOprVm Traceback (most recent call last):
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/executors.py", line 144, in run_jobs
WARNING:toil.leader:b/h/jobtOprVm for job in jobiter:
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/command_line_tool.py", line 405, in job
WARNING:toil.leader:b/h/jobtOprVm reffiles, builder.stagedir, runtimeContext, True)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 273, in make_path_mapper
WARNING:toil.leader:b/h/jobtOprVm runtimeContext.toil_get_file)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 227, in __init__
WARNING:toil.leader:b/h/jobtOprVm referenced_files, basedir, stagedir, separateDirs=separateDirs)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 220, in __init__
WARNING:toil.leader:b/h/jobtOprVm self.setup(dedup(referenced_files), basedir)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 273, in setup
WARNING:toil.leader:b/h/jobtOprVm self.visit(fob, stagedir, basedir, copy=fob.get("writable"), staged=True)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 254, in visit
WARNING:toil.leader:b/h/jobtOprVm resolved = self.get_file(loc) if self.get_file else loc
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 308, in toil_get_file
WARNING:toil.leader:b/h/jobtOprVm src_path = file_store.readGlobalFile(file_store_id[7:])
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/fileStore.py", line 1661, in readGlobalFile
WARNING:toil.leader:b/h/jobtOprVm self.jobStore.readFile(fileStoreID, localFilePath, symlink=symlink)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 380, in readFile
WARNING:toil.leader:b/h/jobtOprVm os.link(jobStoreFilePath, localFilePath)
WARNING:toil.leader:b/h/jobtOprVm OSError: [Errno 18] Invalid cross-device link
WARNING:toil.leader:b/h/jobtOprVm ERROR:cwltool:Got workflow error
WARNING:toil.leader:b/h/jobtOprVm Traceback (most recent call last):
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/executors.py", line 144, in run_jobs
WARNING:toil.leader:b/h/jobtOprVm for job in jobiter:
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/command_line_tool.py", line 405, in job
WARNING:toil.leader:b/h/jobtOprVm reffiles, builder.stagedir, runtimeContext, True)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 273, in make_path_mapper
WARNING:toil.leader:b/h/jobtOprVm runtimeContext.toil_get_file)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 227, in __init__
WARNING:toil.leader:b/h/jobtOprVm referenced_files, basedir, stagedir, separateDirs=separateDirs)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 220, in __init__
WARNING:toil.leader:b/h/jobtOprVm self.setup(dedup(referenced_files), basedir)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 273, in setup
WARNING:toil.leader:b/h/jobtOprVm self.visit(fob, stagedir, basedir, copy=fob.get("writable"), staged=True)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 254, in visit
WARNING:toil.leader:b/h/jobtOprVm resolved = self.get_file(loc) if self.get_file else loc
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 308, in toil_get_file
WARNING:toil.leader:b/h/jobtOprVm src_path = file_store.readGlobalFile(file_store_id[7:])
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/fileStore.py", line 1661, in readGlobalFile
WARNING:toil.leader:b/h/jobtOprVm self.jobStore.readFile(fileStoreID, localFilePath, symlink=symlink)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 380, in readFile
WARNING:toil.leader:b/h/jobtOprVm os.link(jobStoreFilePath, localFilePath)
WARNING:toil.leader:b/h/jobtOprVm OSError: [Errno 18] Invalid cross-device link
WARNING:toil.leader:b/h/jobtOprVm Traceback (most recent call last):
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/worker.py", line 314, in workerScript
WARNING:toil.leader:b/h/jobtOprVm job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/job.py", line 1351, in _runner
WARNING:toil.leader:b/h/jobtOprVm returnValues = self._run(jobGraph, fileStore)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/job.py", line 1296, in _run
WARNING:toil.leader:b/h/jobtOprVm return self.run(fileStore)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 565, in run
WARNING:toil.leader:b/h/jobtOprVm self.cwltool, cwljob, runtime_context, cwllogger)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/executors.py", line 90, in execute
WARNING:toil.leader:b/h/jobtOprVm self.run_jobs(process, job_order_object, logger, runtime_context)
WARNING:toil.leader:b/h/jobtOprVm File "/home/forsmark/venv2/local/lib/python2.7/site-packages/cwltool/executors.py", line 173, in run_jobs
WARNING:toil.leader:b/h/jobtOprVm raise WorkflowException(Text(err))
WARNING:toil.leader:b/h/jobtOprVm WorkflowException: [Errno 18] Invalid cross-device link
WARNING:toil.leader:b/h/jobtOprVm ERROR:toil.worker:Exiting the worker because of a failed job on host dockervm
WARNING:toil.leader:b/h/jobtOprVm WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'file:///home/forsmark/wf.cwl#copy/ddb166bf-0825-46da-912d-75af8116b927' cp b/h/jobtOprVm with ID b/h/jobtOprVm to 0
WARNING:toil.leader:Job 'file:///home/forsmark/wf.cwl#copy/ddb166bf-0825-46da-912d-75af8116b927' cp b/h/jobtOprVm with ID b/h/jobtOprVm is completely failed
INFO:toil.leader:Finished toil run with 2 failed jobs.
INFO:toil.leader:Failed jobs at end of the run: 'file:///home/forsmark/wf.cwl#copy/ddb166bf-0825-46da-912d-75af8116b927' cp b/h/jobtOprVm 'CWLWorkflow' h/7/jobk6icGt
Traceback (most recent call last):
File "/home/forsmark/venv2/bin/cwltoil", line 10, in <module>
sys.exit(main())
File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 1220, in main
outobj = toil.start(wf1)
File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/common.py", line 784, in start
return self._runMainLoop(rootJobGraph)
File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/common.py", line 1059, in _runMainLoop
jobCache=self._jobCache).run()
File "/home/forsmark/venv2/local/lib/python2.7/site-packages/toil/leader.py", line 237, in run
raise FailedJobsException(self.config.jobStore, self.toilState.totalFailedJobs, self.jobStore)
toil.leader.FailedJobsException
It works fine if the input data is on the same filesystem as the jobstore (/nfs
in this case). As expected, it creates a copy of the intermediate files in the workdir because the jobstore is on a separate filesystem. In fact, it will create five(!) separate copies for every intermediate file.
$ cwltoil --clean never --cleanWorkDir never --jobStore /nfs/jobstore/test --workDir /tmp/workdir wf.cwl --file /nfs/large.file
...
$ readlink /nfs/jobstore/test/tmp/n/F/tmpoUzGUJ-writeGlobalFileWrapper-copy.file /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b00a339e-73c5-4347-b24f-8cc70b2d39f3/tmp*.tmp /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b4fe723f-df73-46af-bb22-08a6e8b99f3f/tmp*.tmp
$ ls -i /nfs/jobstore/test/tmp/n/F/tmpoUzGUJ-writeGlobalFileWrapper-copy.file /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/*/*/tmp*.tmp
12817898 /nfs/jobstore/test/tmp/n/F/tmpoUzGUJ-writeGlobalFileWrapper-copy.file
270669 /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b00a339e-73c5-4347-b24f-8cc70b2d39f3/tmpIj9WVo.tmp
270670 /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b00a339e-73c5-4347-b24f-8cc70b2d39f3/tmpT7lo1y.tmp
270659 /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b4fe723f-df73-46af-bb22-08a6e8b99f3f/tmpSTpP55.tmp
270655 /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b4fe723f-df73-46af-bb22-08a6e8b99f3f/tmpqQWwa1.tmp
$ md5sum /nfs/jobstore/test/tmp/n/F/tmpoUzGUJ-writeGlobalFileWrapper-copy.file /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b00a339e-73c5-4347-b24f-8cc70b2d39f3/tmp*.tmp /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b4fe723f-df73-46af-bb22-08a6e8b99f3f/tmp*.tmp
ade18cde4adc34fa8c6804fc955c69ef /nfs/jobstore/test/tmp/n/F/tmpoUzGUJ-writeGlobalFileWrapper-copy.file
ade18cde4adc34fa8c6804fc955c69ef /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b00a339e-73c5-4347-b24f-8cc70b2d39f3/tmpIj9WVo.tmp
ade18cde4adc34fa8c6804fc955c69ef /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b00a339e-73c5-4347-b24f-8cc70b2d39f3/tmpT7lo1y.tmp
ade18cde4adc34fa8c6804fc955c69ef /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b4fe723f-df73-46af-bb22-08a6e8b99f3f/tmpSTpP55.tmp
ade18cde4adc34fa8c6804fc955c69ef /tmp/workdir/toil-f63e92c0-8ff9-482a-95af-522e22dcaeca-f88cf264a5bf4a72bcc41f04517ae518/tmpKQ6JCI/b4fe723f-df73-46af-bb22-08a6e8b99f3f/tmpqQWwa1.tmp
Even if the workdirs are removed by default it will still create redundant overhead. If the intermediate file is large, the time to finish the workflow can be several magnitudes longer. For a file-based job store you should never need to copy data from the jobstore to the workdir (the opposite is often not true of course). Switching to symlinks for the workdir inputs should improve performance significantly.
All of this is particularly a problem when you run cwltoil inside a docker container where hard links can’t be created between bind mounts and the container.
My suggestion would be to have the same behavior for the workdir as the jobstore (also responding to --noLinkImports
).
As a side note, symlinks are always created in “tmpdir” and “tmp-outdir” in the reference implementation cwltool
(with the exception of tools with DockerRequirements where bind mounts are used instead of links).
I’m running toil[cwl]==3.18.0 in a Python 2.7.15rc1 virtualenv.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-49
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
I’m working on this, if anyone wants to provide some early feedback https://github.com/mberacochea/toil/commit/e19479417c905ea43611180f630ada3f8b26fe83
Closing, we believe this was fixed by #3445