Dask Operator becomes unresponsive after ~1hr
See original GitHub issueThis one’s a bit tricky. I haven’t been able to reproduce it in a kind or minikube cluster - it only happens in our hosted environments (i.e., AKS, EKS, etc.). The dask operator deployment works great for a while and then becomes unresponsive. I’ve seen it happen after being up for 1hr and 2hrs (pretty much on the nose).
The status of the pod is running
but the logs are frozen and the operator is unresponsive to KubeCluster
instantiation. Has anyone seen this? If not, I’d really appreciate any guidance on how to efficiently narrow in on the root cause (i.e., increasing log levels, inspecting heartbeats, operator health/status queries, etc.)
Anything else we need to know?:
Note that resetting the dask operator deployment via kubectl rollout restart deployment dask-kubernetes-operator -n dask-operator
recovers the operator after about 60+s (which is how long it takes to terminate the pod once frozen).
Here are the final lines of the dask-kubernetes-operator
pod logs:
[2022-11-28 00:45:46,745] kopf.objects [DEBUG ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Updating diff: (('change', ('status', 'phase'), 'Created', 'Running'),)
[2022-11-28 00:45:46,745] kopf.objects [DEBUG ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Patching with: <redacted>
[2022-11-28 00:45:46,754] kubernetes_asyncio.c [DEBUG ] response body: <redacted>
[2022-11-28 00:45:46,764] kubernetes_asyncio.c [DEBUG ] response body: <redacted>
[2022-11-28 00:45:46,764] kopf.objects [INFO ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Successfully adopted by sedaro-dask
[2022-11-28 00:45:46,784] kubernetes_asyncio.c [DEBUG ] response body: <redacted>
[2022-11-28 00:45:46,801] kubernetes_asyncio.c [DEBUG ] response body: <redacted>
[2022-11-28 00:45:46,805] kopf.objects [INFO ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Scaled worker group sedaro-dask-default up to 1 workers.
[2022-11-28 00:45:46,805] kopf.objects [INFO ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Handler 'daskworkergroup_create' succeeded.
[2022-11-28 00:45:46,806] kopf.objects [INFO ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Creation is processed: 1 succeeded; 0 failed.
[2022-11-28 00:45:46,806] kopf.objects [DEBUG ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Patching with: <redacted>
[2022-11-28 00:45:46,858] kopf.objects [DEBUG ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Something has changed, but we are not interested (the essence is the same).
[2022-11-28 00:45:46,858] kopf.objects [DEBUG ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Handling cycle is finished, waiting for new changes.
[2022-11-28 00:45:46,927] kopf.objects [DEBUG ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Something has changed, but we are not interested (the essence is the same).
[2022-11-28 00:45:46,927] kopf.objects [DEBUG ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Handling cycle is finished, waiting for new changes.
[2022-11-28 00:46:32,107] kopf.objects [DEBUG ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask] Deleted, really deleted, and we are notified.
[2022-11-28 00:46:32,152] kopf.objects [DEBUG ] [6556cd3b-4385-4d5d-98dd-af70b0ad8f80/sedaro-dask-default] Deleted, really deleted, and we are notified.
Environment:
- Dask version: 2022.10.1
- Python version: 3.9.15
- Operating System: Linux
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 10 months ago
- Comments:22 (21 by maintainers)
Top GitHub Comments
Awesome, thanks @jacobtomlinson. I will get this tested and closed ASAP!
@baswelsh once #626 passes CI I’ll merge it and release
2022.11.2
so you can try it out.