question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Workers idle even though there's queued work

See original GitHub issue

What happened: We have a large-ish cluster (about 100 nodes) and recently when we submit a lot of jobs (in the thousands) we notice that about 60% of the cluster is idle. Generally, a job will spawn about 20 downstream sub-jobs; these are submitted from inside the worker which will call secede / rejoin while it waits on those jobs. I’m fairly certain this use of secede / rejoin is related as you can see in the reproduction below.

What you expected to happen: The cluster uses all available resources

Minimal Complete Verifiable Example:

This requires running a scheduler, a worker with two procs, and then submitting jobs. Bear with me while I show all the pieces:

This is how I create the environment:

#!/bin/bash
python3 -m venv env
source env/bin/activate
pip install "dask[distributed,diagnostics,delayed]==2020.12.0"

…and this is the python file with the jobs and such. You can see the two operations are:

  1. Submit a child job to wait for X seconds, and wait on that job. We call secede / rejoin while waiting.
  2. Instantly print out a message
import time
import dask.distributed
import sys


def long_running_job(seconds):
    print(f"Doing the actual work ({seconds}s)")
    time.sleep(seconds)
    print(f"Finished working ({seconds}s)")


def root_job(seconds):
    client = dask.distributed.Client(address="tcp://127.0.0.1:8786")
    futures = client.map(long_running_job, [seconds])
    print(f"Submitted long runing job ({seconds}s); seceding while we wait")
    dask.distributed.secede()
    client.gather(futures)
    print(f"Job done ({seconds}s); rejoining")
    dask.distributed.rejoin()


def other_job(message):
    print(message)


if __name__ == "__main__":
    client = dask.distributed.Client(address="tcp://127.0.0.1:8786")
    if sys.argv[1] == "wait":
        future = client.submit(root_job, int(sys.argv[2]))
        dask.distributed.fire_and_forget(future)
    elif sys.argv[1] == "message":
        future = client.submit(other_job, sys.argv[2])
        dask.distributed.fire_and_forget(future)

Finally, this is the script that will submit jobs that will show the issue we’re running into:

#!/bin/bash

# start a scheduler in one terminal:
# $ dask-scheduler
# ...and a worker in another:
# $ dask-worker tcp://10.0.2.15:8786 --nthreads 1 --nprocs 2
# then run the below:

python stuff.py wait 120
sleep 1
python stuff.py wait 60
sleep 1
python stuff.py message instantaneous-job

This script submits a long job, a shorter job, and then just an instantaneous job to show that there’s a scheduling problem. When the jobs are submitted the worker will print out:

Submitted long runing job (120s); seceding while we wait
Doing the actual work (120s)
Submitted long runing job (60s); seceding while we wait
Doing the actual work (60s)
Finished working (60s)
Job done (60s); rejoining
Finished working (120s)
instantaneous-job
Job done (120s); rejoining

The problem is on the line that says Job done (60s); rejoining. At this point there’s one idle worker that could be running the instantaneous job but it doesn’t – instead it waits on the 120s job. After the 120s job is done (about a minute later) that instantaneous job finally runs. Hence the worker is idle for about a minute.

Anything else we need to know?: Sorry for the length; I don’t think I can cut it down any more. If the problem isn’t clear let me know and I’ll see if I can explain better.

Environment:

  • Dask version: 2020.12.0
  • Python version: 3.6.9
  • Operating System: Ubuntu 20.04
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
fjettercommented, Feb 17, 2021

Haven’t looked deeply into this, yet, and there are differences due to seceding/long runnign jobs but a similar issue was reported in #4471

0reactions
chrisroatcommented, Mar 30, 2021

I am the author of the related issue, and also am forced to over-provision. Is there any direction on where to look for issues? I’m spending some time this week learning the scheduler so as to look into this an other issues I’m having.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sometimes workers just refuse to work. :: Little Big Workshop ...
Here's a little tip when troubleshooting: Try tracing the specific queued item and see where the parts leading into it are coming from...
Read more >
Why do I have idle Employees when there are outstanding ...
Hello! My understanding is that Employees who are not assigned to a full time job (in a Building or through the Employee Management...
Read more >
Queueing up workers in Dask - Stack Overflow
Is there a way to queue up the workers in such a way that if I define that Dask has X workers, when...
Read more >
Check if queue:worker is processing job or in idle state
If there are jobs in jobs table, then supervisor will automatically run them. Only time queue workers will be idle is when there...
Read more >
Performance and scale in Durable Functions (Azure Functions)
It is even possible to scale to zero if the task hub is completely idle. When scaled to zero, there are no workers...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found