Cluster hangs with a few tasks in "processing" state but no cpu load on any workers
See original GitHub issueThis problem is stochastic. It seems to occur more frequently when there is more sharing of data between workers. map_overlap
calls seem particularly problematic.
Cluster is set up using dask-jobqueue.LSFCluster
and dask.distributed.Client
cluster = LSFCluster(
cores, ncpus, memory, mem,
walltime=walltime,
env_extra=env_extra,
**kwargs,
)
client = Client(cluster)
cluster.scale(job=njobs) # number of workers
Workers are all allocated properly, bash scripts invoking LSF all seem fine. The task graph starts to execute, but then gets hung up and sits indefinitely in this type of state:
No workers show any cpu activity (2-4% for all workers). env_extra
above makes sure all MKL, BLAS, and OpenMP environment variables are set to 2 threads per core (should be fine with hyper threading?).
When I click on the red task on the left of the graph I see: hung_cluster_last_task_left.pdf
When I click on the red task on the right of the graph (second to last column) I see: hung_cluster_last_task.pdf
For the red task on the right, the two “workers with data” show:
I’ve let these hang for upwards of 30 minutes with no meaningful cpu activity on any workers before killing the cluster manually. I can’t let it run any longer because I’m paying for cluster time so I don’t know if it’s just (intractably) slow or totally hung. Comparatively the entire rest of the task graph was executed in less than 180 seconds.
Any pointers as to what could be causing this or how to permanently avoid it would be really appreciated.
- Dask version: 2020.12.0
- Python version: 3.8.5
- Operating System: CentOS
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (5 by maintainers)
Top GitHub Comments
@fjetter I think this may be another instance of the issues you’re working on.
I agree that it’s difficult to reproduce. On several occasions I’ve convinced myself that the problem was solved only to find out on the next big run that it wasn’t. I think it must have something to do with worker to worker communication of dependencies and/or task states, which can be disrupted for a number of reasons, and then potentially is not reset properly by the scheduler. That’s all speculation, but the error logs and behavior so far point that way. On some lucky runs I think it’s possible that the disrupting events just don’t occur (e.g. might be dependent on network traffic).