Toil does user work in local jobs in CWL workflows. Is chaining to blame?
See original GitHub issueRunning Toil version 4.2.0a1-3f4e790c1fa32a2c2257bd5cd203a3b51d5d5661 Batch scheduler: LSF
Toil is thinking that there are running jobs, but we don’t have them in bqueue
looks like
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-566
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (6 by maintainers)
Top Results From Across the Web
Toil Architecture — Toil 5.8.0a1 documentation
User scripts inherit from the Job class to define units of work. These jobs are pickled and stored in the job-store by the...
Read more >Latest toil no longer works with cactus · Issue #2854
A cactus command that works in version 3.20 now gives this error with master (I'm using a newer toil to test kubernetes for...
Read more >Toil · BioExcel Best Practice Guide
Toil is a client only workflow engine, and requires only a few python packages to be installed. CWL support is provided by the...
Read more >Toil Documentation
Toil runs in various environments, including locally and in the cloud (Amazon Web Services and Google Compute Engine). Toil also supports two DSLs:...
Read more >Introduction to Workflows with Common Workflow Language
A computational workflow is a formalised chain of software tools, which explicitly ... This means users can move between local development and cloud...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Local jobs are jobs that are run on the leader machine, and aren’t sent through LSF or whatever other cluster you’re using. There’s a list of their names here, and they’re identified/determined by name; I think the point might be to make sure that Toil jobs that are just part of the CWL interpreter don’t appear to users alongside the jobs that are actual work specified by the CWL workflow.
I think something has gone wrong with this system, because it isn’t supposed to result in actual user work (like
hmmscan
itself) running on the leader. I think the chaining system (which allows one job to immediately go on to run another later job, if it fits in the same resource allotment, rather than submitting it for scheduling) might be causing trouble here. The internal, local CWL job is creating a job meant to do actual work, which is then getting chained to and run on the leader instead of sent off for scheduling through LSF like it should be. Running the workflow with--disableChaining
might be a workaround, and we should be able to fix the chaining system to not operate on these internal CWL jobs.In terms of detecting the legitimate jobs running on the leader, you can use
ps
ortree
to look for child processes of the toil leader. All the local jobs run in their own processes as the_toil_worker
executable. So if you see any of those running, you know Toil is trying to do work on the leader and isn’t just stuck.OK, @KeteSakharova, that log definitely shows that the problem is stuck local jobs, and nothing to do with
bjobs
or LSF.Look at this excerpt:
It wants to run a command locally (
_toil_worker CWLWorkflow file:/nfs/public/release/metagenomics_scratch/pipeline-5/job-store-test_fa_wf kind-CWLWorkflow/instance-hco0q1nq
). It starts it up as PID 26921, and never reports that that process ends. (Compare to what it says about child PID 65331, which does end.) At the end of the workflow, when you presumably kill the whole workflow, the child process running this job seems to still be alive (as far as Toil knows), and the workflow is still waiting for it. Depending on how long it ought to take and how long you waited for it, it might even be doing useful work; the log doesn’t have good timestamps.I think the way to debug this is to just run that command (
_toil_worker CWLWorkflow file:/nfs/public/release/metagenomics_scratch/pipeline-5/job-store-test_fa_wf kind-CWLWorkflow/instance-hco0q1nq
) yourself, without the leader running. That will run just the offending, stuck job by itself on the local machine. It should print a log, and that log will probably mention another file that it logs to that you would want to look in.If the command finishes in a timely fashion (or if it complains that the job no longer exists; they go away when they complete), then the problem is in the Toil leader’s ability to detect that its child processes are finished.
If the command runs for an unacceptable amount of time, then its own logs may illuminate why that is. You also might be able to run it under a Python debugger to see where it is getting stuck, if you are familiar with one.
You could also try posting the file
kind-CWLWorkflow/instance-hco0q1nq
which can be found in the job store (find /nfs/public/release/metagenomics_scratch/pipeline-5/job-store-test_fa_wf | grep instance-hco0q1nq
will probably uncover its full path). That file has the serialized job, and if it is obnoxiously large or looks like it wants to download 10 million input files or something, that could be a plausible explanation for why it is getting stuck. It also can probably be inspected to uncover exactly what part of the CWL workflow it represents. These job files are Python pickle files, and they can be inspected in an interactive Python shell by just reading the file contents and fluffing them up into a Python object with thepickle
module. So if you send it along we can look at it and see if there’s anything obviously wrong with it.