question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Toil does user work in local jobs in CWL workflows. Is chaining to blame?

See original GitHub issue

Running Toil version 4.2.0a1-3f4e790c1fa32a2c2257bd5cd203a3b51d5d5661 Batch scheduler: LSF

Toil is thinking that there are running jobs, but we don’t have them in bqueue

looks like

6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run
6 jobs are running, 1 jobs are issued and waiting to run

https://github.com/EBI-Metagenomics/pipeline-v5/blob/master/workflows/subworkflows/raw_reads/functional_annotation_raw.cwl

https://github.com/EBI-Metagenomics/pipeline-v5/blob/master/workflows/conditionals/raw-reads/raw-reads-2.cwl

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-566

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
adamnovakcommented, Jun 18, 2020

Local jobs are jobs that are run on the leader machine, and aren’t sent through LSF or whatever other cluster you’re using. There’s a list of their names here, and they’re identified/determined by name; I think the point might be to make sure that Toil jobs that are just part of the CWL interpreter don’t appear to users alongside the jobs that are actual work specified by the CWL workflow.

I think something has gone wrong with this system, because it isn’t supposed to result in actual user work (like hmmscan itself) running on the leader. I think the chaining system (which allows one job to immediately go on to run another later job, if it fits in the same resource allotment, rather than submitting it for scheduling) might be causing trouble here. The internal, local CWL job is creating a job meant to do actual work, which is then getting chained to and run on the leader instead of sent off for scheduling through LSF like it should be. Running the workflow with --disableChaining might be a workaround, and we should be able to fix the chaining system to not operate on these internal CWL jobs.

In terms of detecting the legitimate jobs running on the leader, you can use ps or tree to look for child processes of the toil leader. All the local jobs run in their own processes as the _toil_worker executable. So if you see any of those running, you know Toil is trying to do work on the leader and isn’t just stuck.

1reaction
adamnovakcommented, Jun 16, 2020

OK, @KeteSakharova, that log definitely shows that the problem is stuck local jobs, and nothing to do with bjobs or LSF.

Look at this excerpt:

Issuing local command: _toil_worker CWLWorkflow file:/nfs/public/release/metagenomics_scratch/pipeline-5/job-store-test_fa_wf kind-CWLWorkflow/instance-hco0q1nq with memory: 42949672960, cores: 8, disk: 2147483648
Issued job 'CWLWorkflow' kind-CWLWorkflow/instance-hco0q1nq with job batch system ID: 13 and cores: 8, disk: 2.0 G, and memory: 40.0 G
...
Launched job 13 as child 26921
...
Found 0 jobs running on the batch scheduler and 3 jobs running locally
...

It wants to run a command locally (_toil_worker CWLWorkflow file:/nfs/public/release/metagenomics_scratch/pipeline-5/job-store-test_fa_wf kind-CWLWorkflow/instance-hco0q1nq). It starts it up as PID 26921, and never reports that that process ends. (Compare to what it says about child PID 65331, which does end.) At the end of the workflow, when you presumably kill the whole workflow, the child process running this job seems to still be alive (as far as Toil knows), and the workflow is still waiting for it. Depending on how long it ought to take and how long you waited for it, it might even be doing useful work; the log doesn’t have good timestamps.

I think the way to debug this is to just run that command (_toil_worker CWLWorkflow file:/nfs/public/release/metagenomics_scratch/pipeline-5/job-store-test_fa_wf kind-CWLWorkflow/instance-hco0q1nq) yourself, without the leader running. That will run just the offending, stuck job by itself on the local machine. It should print a log, and that log will probably mention another file that it logs to that you would want to look in.

If the command finishes in a timely fashion (or if it complains that the job no longer exists; they go away when they complete), then the problem is in the Toil leader’s ability to detect that its child processes are finished.

If the command runs for an unacceptable amount of time, then its own logs may illuminate why that is. You also might be able to run it under a Python debugger to see where it is getting stuck, if you are familiar with one.

You could also try posting the file kind-CWLWorkflow/instance-hco0q1nq which can be found in the job store (find /nfs/public/release/metagenomics_scratch/pipeline-5/job-store-test_fa_wf | grep instance-hco0q1nq will probably uncover its full path). That file has the serialized job, and if it is obnoxiously large or looks like it wants to download 10 million input files or something, that could be a plausible explanation for why it is getting stuck. It also can probably be inspected to uncover exactly what part of the CWL workflow it represents. These job files are Python pickle files, and they can be inspected in an interactive Python shell by just reading the file contents and fluffing them up into a Python object with the pickle module. So if you send it along we can look at it and see if there’s anything obviously wrong with it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Toil Architecture — Toil 5.8.0a1 documentation
User scripts inherit from the Job class to define units of work. These jobs are pickled and stored in the job-store by the...
Read more >
Latest toil no longer works with cactus · Issue #2854
A cactus command that works in version 3.20 now gives this error with master (I'm using a newer toil to test kubernetes for...
Read more >
Toil · BioExcel Best Practice Guide
Toil is a client only workflow engine, and requires only a few python packages to be installed. CWL support is provided by the...
Read more >
Toil Documentation
Toil runs in various environments, including locally and in the cloud (Amazon Web Services and Google Compute Engine). Toil also supports two DSLs:...
Read more >
Introduction to Workflows with Common Workflow Language
A computational workflow is a formalised chain of software tools, which explicitly ... This means users can move between local development and cloud...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found