question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cluster hangs with a few tasks in "processing" state but no cpu load on any workers

See original GitHub issue

This problem is stochastic. It seems to occur more frequently when there is more sharing of data between workers. map_overlap calls seem particularly problematic.

Cluster is set up using dask-jobqueue.LSFCluster and dask.distributed.Client

cluster = LSFCluster(
    cores, ncpus, memory, mem,
    walltime=walltime,
    env_extra=env_extra,
    **kwargs,
)
client = Client(cluster)
cluster.scale(job=njobs)  # number of workers

Workers are all allocated properly, bash scripts invoking LSF all seem fine. The task graph starts to execute, but then gets hung up and sits indefinitely in this type of state:

Screen Shot 2021-04-09 at 12 26 36 PM Screen Shot 2021-04-09 at 12 27 24 PM

No workers show any cpu activity (2-4% for all workers). env_extra above makes sure all MKL, BLAS, and OpenMP environment variables are set to 2 threads per core (should be fine with hyper threading?).

When I click on the red task on the left of the graph I see: hung_cluster_last_task_left.pdf

When I click on the red task on the right of the graph (second to last column) I see: hung_cluster_last_task.pdf

For the red task on the right, the two “workers with data” show:

Screen Shot 2021-04-09 at 12 28 30 PM Screen Shot 2021-04-09 at 12 28 32 PM

I’ve let these hang for upwards of 30 minutes with no meaningful cpu activity on any workers before killing the cluster manually. I can’t let it run any longer because I’m paying for cluster time so I don’t know if it’s just (intractably) slow or totally hung. Comparatively the entire rest of the task graph was executed in less than 180 seconds.

Any pointers as to what could be causing this or how to permanently avoid it would be really appreciated.

  • Dask version: 2020.12.0
  • Python version: 3.8.5
  • Operating System: CentOS
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
abergoucommented, Apr 21, 2021

@fjetter I think this may be another instance of the issues you’re working on.

1reaction
GFleishmancommented, Apr 20, 2021

I agree that it’s difficult to reproduce. On several occasions I’ve convinced myself that the problem was solved only to find out on the next big run that it wasn’t. I think it must have something to do with worker to worker communication of dependencies and/or task states, which can be disrupted for a number of reasons, and then potentially is not reset properly by the scheduler. That’s all speculation, but the error logs and behavior so far point that way. On some lucky runs I think it’s possible that the disrupting events just don’t occur (e.g. might be dependent on network traffic).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot Amazon ECS tasks stuck in the PENDING state
Some common scenarios that can cause your ECS task to be stuck in the PENDING state include the following: The Docker daemon is...
Read more >
Two years with Celery in Production: Bug Fix Edition - Medium
Workers stayed idle, not consuming any tasks​​ The queue had available tasks, and healthy consumers, but the tasks weren't being consumed. This ...
Read more >
Performance, hang, or high CPU issues on Windows - IBM
Problem. If you are experiencing performance degradation, hang, no response, hung threads, CPU starvation, high CPU utilization, network delays, ...
Read more >
Fix common cluster issues | Elasticsearch Guide [8.5] | Elastic
If Elasticsearch estimates an operation would exceed a circuit breaker, it stops the operation and returns an error. High CPU usage: The most...
Read more >
High Load Average with modest CPU Utilization and almost ...
You may want to experiment with number of cores per VM versus total number of VMs. Some applications will benefit from many cores...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found