question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dask Operator becomes unresponsive after ~1hr

See original GitHub issue

This one’s a bit tricky. I haven’t been able to reproduce it in a kind or minikube cluster - it only happens in our hosted environments (i.e., AKS, EKS, etc.). The dask operator deployment works great for a while and then becomes unresponsive. I’ve seen it happen after being up for 1hr and 2hrs (pretty much on the nose).

The status of the pod is running but the logs are frozen and the operator is unresponsive to KubeCluster instantiation. Has anyone seen this? If not, I’d really appreciate any guidance on how to efficiently narrow in on the root cause (i.e., increasing log levels, inspecting heartbeats, operator health/status queries, etc.)

Anything else we need to know?:

Note that resetting the dask operator deployment via kubectl rollout restart deployment dask-kubernetes-operator -n dask-operator recovers the operator after about 60+s (which is how long it takes to terminate the pod once frozen).

Here are the final lines of the dask-kubernetes-operator pod logs:

[2022-11-28 00:45:46,745] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Updating diff: (('change', ('status', 'phase'), 'Created', 'Running'),)
[2022-11-28 00:45:46,745] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Patching with: <redacted>
[2022-11-28 00:45:46,754] kubernetes_asyncio.c [DEBUG   ] response body: <redacted>
[2022-11-28 00:45:46,764] kubernetes_asyncio.c [DEBUG   ] response body: <redacted>
[2022-11-28 00:45:46,764] kopf.objects         [INFO    ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Successfully adopted by sedaro-dask
[2022-11-28 00:45:46,784] kubernetes_asyncio.c [DEBUG   ] response body: <redacted>
[2022-11-28 00:45:46,801] kubernetes_asyncio.c [DEBUG   ] response body: <redacted>
[2022-11-28 00:45:46,805] kopf.objects         [INFO    ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Scaled worker group sedaro-dask-default up to 1 workers.
[2022-11-28 00:45:46,805] kopf.objects         [INFO    ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Handler 'daskworkergroup_create' succeeded.
[2022-11-28 00:45:46,806] kopf.objects         [INFO    ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Creation is processed: 1 succeeded; 0 failed.
[2022-11-28 00:45:46,806] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Patching with: <redacted>
[2022-11-28 00:45:46,858] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Something has changed, but we are not interested (the essence is the same).
[2022-11-28 00:45:46,858] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Handling cycle is finished, waiting for new changes.
[2022-11-28 00:45:46,927] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Something has changed, but we are not interested (the essence is the same).
[2022-11-28 00:45:46,927] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Handling cycle is finished, waiting for new changes.
[2022-11-28 00:46:32,107] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Deleted, really deleted, and we are notified.
[2022-11-28 00:46:32,152] kopf.objects         [DEBUG   ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Deleted, really deleted, and we are notified.

Environment:

  • Dask version: 2022.10.1
  • Python version: 3.9.15
  • Operating System: Linux
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:22 (21 by maintainers)

github_iconTop GitHub Comments

1reaction
baswelshcommented, Nov 30, 2022

Awesome, thanks @jacobtomlinson. I will get this tested and closed ASAP!

1reaction
jacobtomlinsoncommented, Nov 29, 2022

@baswelsh once #626 passes CI I’ll merge it and release 2022.11.2 so you can try it out.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · dask/dask-kubernetes - GitHub
Contribute to dask/dask-kubernetes development by creating an account on GitHub. ... Dask Operator becomes unresponsive after ~1hr bug operator.
Read more >
Dask: handling unresponsive workers - python - Stack Overflow
The problem is that tasks submitted on these workers (they become unresponsive after receiving a task, maybe when loading the environment) ...
Read more >
Dask Under the Hood: Scheduler Refactor - Coiled
You'll notice that the environment hangs for a noticeable period. That's because Dask is decomposing your initial, tiny high-level graph (“make ...
Read more >
Dask Kubernetes Operator
We are excited to announce that the Dask Kubernetes Operator is now generally available 🎉! Notable new features include: Dask Clusters are now ......
Read more >
Mini Cooper cluster 2 not working various faults - YouTube
2002 Mini Cooper S R53 cluster Kombi 2 not working and various faults. ... Also dash lights didn't work both clusters , Trip...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found