[Bug] Deletion of Ray clusters hangs while Ray operator is still up
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Clusters
What happened + What you expected to happen
Problem
After executing kubectl delete raycluster <raycluster-name>
, I see that the command hangs. It looks like this is because Kubernetes finalizers are preventing the deletion of the resource. Since the Ray operator is up, I would expect the operator to lift the finalizer eventually, but I do not see this happening.
Reproduce I deployed the Ray helm chart onto an AKS cluster, did not run any operation, gave some time, and tried to delete the resource. The deletion however hangs.
Tried
- I am already aware that patching the finalizer to
null
actually deletes the resource immediately. However, I’ve been having issues bringing up new Ray clusters with the same name (this is necessary for our case), so I can’t go with this option every time. - We killed and restarted the Ray operator pod and this terminates all Ray clusters marked for deletion. However, I am not sure if this method is sustainable.
Additional Question
- What happens if I restart the ray operator while other rayclusters are active?
- What is the finalizer condition here and is it safe to disable it?
Logs I first see that the finalizer condition has been added to the ray cluster.
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
annotations:
kopf.zalando.org/last-handled-configuration: |
{"spec":{"headPodType":"rayHeadType","headStartRayCommands":["ray stop","ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0"],"idleTimeoutMinutes":5,"maxWorkers":3,"podTypes":[{"maxWorkers":0,"minWorkers":0,"name":"rayHeadType","podConfig":{"apiVersion":"v1","kind":"Pod","metadata":{"generateName":"ray-head-type-"},"spec":{"containers":[{"args":["trap : TERM INT; sleep infinity & wait;"],"command":["/bin/bash","-c","--"],"env":[{"name":"RAY_gcs_server_rpc_server_thread_num","value":"1"}],"image":"rayproject/ray:latest","imagePullPolicy":"Always","name":"ray-node","ports":[{"containerPort":6379,"protocol":"TCP"},{"containerPort":10001,"protocol":"TCP"},{"containerPort":8265,"protocol":"TCP"},{"containerPort":8000,"protocol":"TCP"}],"resources":{"limits":{"cpu":1,"memory":"512Mi"},"requests":{"cpu":1,"memory":"512Mi"}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"}]}],"restartPolicy":"Never","volumes":[{"emptyDir":{"medium":"Memory"},"name":"dshm"}]}}},{"maxWorkers":3,"minWorkers":2,"name":"rayWorkerType","podConfig":{"apiVersion":"v1","kind":"Pod","metadata":{"generateName":"ray-worker-type-"},"spec":{"containers":[{"args":["trap : TERM INT; sleep infinity & wait;"],"command":["/bin/bash","-c","--"],"env":[{"name":"RAY_gcs_server_rpc_server_thread_num","value":"1"}],"image":"rayproject/ray:latest","imagePullPolicy":"Always","name":"ray-node","ports":[{"containerPort":6379,"protocol":"TCP"},{"containerPort":10001,"protocol":"TCP"},{"containerPort":8265,"protocol":"TCP"},{"containerPort":8000,"protocol":"TCP"}],"resources":{"limits":{"cpu":1,"memory":"512Mi"},"requests":{"cpu":1,"memory":"512Mi"}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"}]}],"restartPolicy":"Never","volumes":[{"emptyDir":{"medium":"Memory"},"name":"dshm"}]}}}],"upscalingSpeed":1,"workerStartRayCommands":["ray stop","ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379"]},"metadata":{"labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"ray","meta.helm.sh/release-namespace":"ray"}},"status":{"autoscalerRetries":0}}
meta.helm.sh/release-name: ray
meta.helm.sh/release-namespace: ray
creationTimestamp: "2022-02-18T07:43:26Z"
finalizers:
- kopf.zalando.org/KopfFinalizerMarker
generation: 1
I see that the operator is up and running as well.
NAME READY STATUS RESTARTS AGE
ray-operator-b4cdbf848-qfn7r 1/1 Running 0 15h
ray-ray-head-type-8fbvd 1/1 Running 0 15h
ray-ray-worker-type-jjv77 1/1 Running 0 15h
ray-ray-worker-type-k6r4d 1/1 Running 0 15h
I also see the operator is properly monitoring the Ray cluster resource.
======== Autoscaler status: 2022-02-18 14:59:47.002896 ========
Node status
---------------------------------------------------------------
Healthy:
1 rayHeadType
2 rayWorkerType
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/3.0 CPU
0.00/1.050 GiB memory
0.00/0.374 GiB object_store_memory
Demands:
(no resource demands)
ray,ray:2022-02-18 14:59:47,037 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.15790367126464844, '10.244.0.60': 0.15787005424499512, '10.244.1.86': 0.15784144401550293}\n - NodeIdleSeconds: Min=54958 Mean=54961 Max=54969\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 14:59:47,038 DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
- MostDelayedHeartbeats: {'10.244.0.59': 0.15790367126464844, '10.244.0.60': 0.15787005424499512, '10.244.1.86': 0.15784144401550293}
- NodeIdleSeconds: Min=54958 Mean=54961 Max=54969
- ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- rayWorkerType: 2
ray,ray:2022-02-18 14:59:47,112 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:47,137 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:47,163 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:47,180 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:47,315 DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'object_store_memory': 135085670.0, 'memory': 375809638.0, 'node:10.244.0.59': 1.0, 'CPU': 1.0}, {'node:10.244.0.60': 1.0, 'object_store_memory': 132774297.0, 'memory': 375809638.0, 'CPU': 1.0}, {'CPU': 1.0, 'memory': 375809638.0, 'object_store_memory': 133639372.0, 'node:10.244.1.86': 1.0}]
ray,ray:2022-02-18 14:59:47,315 DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 14:59:47,315 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 14:59:47,315 DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 14:59:47,315 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 14:59:47,315 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 14:59:47,366 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 14:59:47,427 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"node:10.244.0.59": [0.0, 1.0], "CPU": [0.0, 3.0], "object_store_memory": [0.0, 401499339.0], "memory": [0.0, 1127428914.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"node:10.244.0.59": 1.0, "CPU": 1.0, "memory": 375809638.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225186.8461728, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
ray,ray:2022-02-18 14:59:52,430 DEBUG gcs_utils.py:238 -- internal_kv_get b'autoscaler_resource_request' None
ray,ray:2022-02-18 14:59:52,675 INFO autoscaler.py:304 --
======== Autoscaler status: 2022-02-18 14:59:52.675515 ========
Node status
---------------------------------------------------------------
Healthy:
1 rayHeadType
2 rayWorkerType
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/3.0 CPU
0.00/1.050 GiB memory
0.00/0.374 GiB object_store_memory
Demands:
(no resource demands)
ray,ray:2022-02-18 14:59:52,712 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.2449197769165039, '10.244.0.60': 0.24487924575805664, '10.244.1.86': 0.24484586715698242}\n - NodeIdleSeconds: Min=54963 Mean=54967 Max=54974\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 14:59:52,713 DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
- MostDelayedHeartbeats: {'10.244.0.59': 0.2449197769165039, '10.244.0.60': 0.24487924575805664, '10.244.1.86': 0.24484586715698242}
- NodeIdleSeconds: Min=54963 Mean=54967 Max=54974
- ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- rayWorkerType: 2
ray,ray:2022-02-18 14:59:52,791 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:52,817 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:52,843 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:52,864 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:52,966 DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'CPU': 1.0, 'node:10.244.0.59': 1.0, 'memory': 375809638.0, 'object_store_memory': 135085670.0}, {'memory': 375809638.0, 'object_store_memory': 132774297.0, 'node:10.244.0.60': 1.0, 'CPU': 1.0}, {'memory': 375809638.0, 'node:10.244.1.86': 1.0, 'CPU': 1.0, 'object_store_memory': 133639372.0}]
ray,ray:2022-02-18 14:59:52,966 DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 14:59:52,966 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 14:59:52,966 DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 14:59:52,966 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 14:59:52,966 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 14:59:53,030 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 14:59:53,092 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"CPU": [0.0, 3.0], "object_store_memory": [0.0, 401499339.0], "node:10.244.0.59": [0.0, 1.0], "memory": [0.0, 1127428914.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"node:10.244.0.59": 1.0, "CPU": 1.0, "memory": 375809638.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225192.4319715, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
After attempting to delete the resource with kubectl delete raycluster <raycluster-name>
, I see that the resource has been marked for deletion but the command hangs.
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
annotations:
kopf.zalando.org/last-handled-configuration: |
{"spec":{"headPodType":"rayHeadType","headStartRayCommands":["ray stop","ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0"],"idleTimeoutMinutes":5,"maxWorkers":3,"podTypes":[{"maxWorkers":0,"minWorkers":0,"name":"rayHeadType","podConfig":{"apiVersion":"v1","kind":"Pod","metadata":{"generateName":"ray-head-type-"},"spec":{"containers":[{"args":["trap : TERM INT; sleep infinity & wait;"],"command":["/bin/bash","-c","--"],"env":[{"name":"RAY_gcs_server_rpc_server_thread_num","value":"1"}],"image":"rayproject/ray:latest","imagePullPolicy":"Always","name":"ray-node","ports":[{"containerPort":6379,"protocol":"TCP"},{"containerPort":10001,"protocol":"TCP"},{"containerPort":8265,"protocol":"TCP"},{"containerPort":8000,"protocol":"TCP"}],"resources":{"limits":{"cpu":1,"memory":"512Mi"},"requests":{"cpu":1,"memory":"512Mi"}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"}]}],"restartPolicy":"Never","volumes":[{"emptyDir":{"medium":"Memory"},"name":"dshm"}]}}},{"maxWorkers":3,"minWorkers":2,"name":"rayWorkerType","podConfig":{"apiVersion":"v1","kind":"Pod","metadata":{"generateName":"ray-worker-type-"},"spec":{"containers":[{"args":["trap : TERM INT; sleep infinity & wait;"],"command":["/bin/bash","-c","--"],"env":[{"name":"RAY_gcs_server_rpc_server_thread_num","value":"1"}],"image":"rayproject/ray:latest","imagePullPolicy":"Always","name":"ray-node","ports":[{"containerPort":6379,"protocol":"TCP"},{"containerPort":10001,"protocol":"TCP"},{"containerPort":8265,"protocol":"TCP"},{"containerPort":8000,"protocol":"TCP"}],"resources":{"limits":{"cpu":1,"memory":"512Mi"},"requests":{"cpu":1,"memory":"512Mi"}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"}]}],"restartPolicy":"Never","volumes":[{"emptyDir":{"medium":"Memory"},"name":"dshm"}]}}}],"upscalingSpeed":1,"workerStartRayCommands":["ray stop","ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379"]},"metadata":{"labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"ray","meta.helm.sh/release-namespace":"ray"}},"status":{"autoscalerRetries":0}}
meta.helm.sh/release-name: ray
meta.helm.sh/release-namespace: ray
creationTimestamp: "2022-02-18T07:43:26Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2022-02-18T23:01:10Z"
finalizers:
- kopf.zalando.org/KopfFinalizerMarker
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
name: ray
namespace: ray
resourceVersion: "80922414"
uid: e9782156-db7f-4795-97eb-680b8b149bd5
I also see that the Ray operator is still monitoring the resource.
======== Autoscaler status: 2022-02-18 15:01:33.084575 ========
Node status
---------------------------------------------------------------
Healthy:
1 rayHeadType
2 rayWorkerType
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/3.0 CPU
0.00/1.050 GiB memory
0.00/0.374 GiB object_store_memory
Demands:
(no resource demands)
ray,ray:2022-02-18 15:01:33,173 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.18732261657714844, '10.244.0.60': 0.1872880458831787, '10.244.1.86': 0.18725895881652832}\n - NodeIdleSeconds: Min=55064 Mean=55067 Max=55075\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 15:01:33,174 DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
- MostDelayedHeartbeats: {'10.244.0.59': 0.18732261657714844, '10.244.0.60': 0.1872880458831787, '10.244.1.86': 0.18725895881652832}
- NodeIdleSeconds: Min=55064 Mean=55067 Max=55075
- ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- rayWorkerType: 2
ray,ray:2022-02-18 15:01:33,262 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:33,293 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:33,320 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:33,336 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:33,444 DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'node:10.244.0.59': 1.0, 'memory': 375809638.0, 'CPU': 1.0, 'object_store_memory': 135085670.0}, {'object_store_memory': 132774297.0, 'node:10.244.0.60': 1.0, 'CPU': 1.0, 'memory': 375809638.0}, {'object_store_memory': 133639372.0, 'memory': 375809638.0, 'node:10.244.1.86': 1.0, 'CPU': 1.0}]
ray,ray:2022-02-18 15:01:33,444 DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 15:01:33,445 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 15:01:33,445 DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 15:01:33,445 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 15:01:33,445 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 15:01:33,499 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 15:01:33,552 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"memory": [0.0, 1127428914.0], "object_store_memory": [0.0, 401499339.0], "node:10.244.0.59": [0.0, 1.0], "CPU": [0.0, 3.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"memory": 375809638.0, "CPU": 1.0, "node:10.244.0.59": 1.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225292.8986323, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
ray,ray:2022-02-18 15:01:38,558 DEBUG gcs_utils.py:238 -- internal_kv_get b'autoscaler_resource_request' None
ray,ray:2022-02-18 15:01:38,705 INFO autoscaler.py:304 --
======== Autoscaler status: 2022-02-18 15:01:38.705050 ========
Node status
---------------------------------------------------------------
Healthy:
1 rayHeadType
2 rayWorkerType
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/3.0 CPU
0.00/1.050 GiB memory
0.00/0.374 GiB object_store_memory
Demands:
(no resource demands)
ray,ray:2022-02-18 15:01:38,738 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.14638352394104004, '10.244.0.60': 0.1463487148284912, '10.244.1.86': 0.14631962776184082}\n - NodeIdleSeconds: Min=55069 Mean=55073 Max=55080\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 15:01:38,739 DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
- MostDelayedHeartbeats: {'10.244.0.59': 0.14638352394104004, '10.244.0.60': 0.1463487148284912, '10.244.1.86': 0.14631962776184082}
- NodeIdleSeconds: Min=55069 Mean=55073 Max=55080
- ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- rayWorkerType: 2
ray,ray:2022-02-18 15:01:38,821 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:38,845 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:38,872 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:38,889 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:38,996 DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'node:10.244.0.59': 1.0, 'CPU': 1.0, 'object_store_memory': 135085670.0, 'memory': 375809638.0}, {'memory': 375809638.0, 'node:10.244.0.60': 1.0, 'CPU': 1.0, 'object_store_memory': 132774297.0}, {'object_store_memory': 133639372.0, 'memory': 375809638.0, 'CPU': 1.0, 'node:10.244.1.86': 1.0}]
ray,ray:2022-02-18 15:01:38,996 DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 15:01:38,996 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 15:01:38,996 DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 15:01:38,996 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 15:01:38,996 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 15:01:39,045 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 15:01:39,095 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"memory": [0.0, 1127428914.0], "CPU": [0.0, 3.0], "object_store_memory": [0.0, 401499339.0], "node:10.244.0.59": [0.0, 1.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"memory": 375809638.0, "CPU": 1.0, "node:10.244.0.59": 1.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225298.5599062, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
ray,ray:2022-02-18 15:01:44,102 DEBUG gcs_utils.py:238 -- internal_kv_get b'autoscaler_resource_request' None
ray,ray:2022-02-18 15:01:44,262 INFO autoscaler.py:304 --
======== Autoscaler status: 2022-02-18 15:01:44.261951 ========
Node status
---------------------------------------------------------------
Healthy:
1 rayHeadType
2 rayWorkerType
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/3.0 CPU
0.00/1.050 GiB memory
0.00/0.374 GiB object_store_memory
Demands:
(no resource demands)
ray,ray:2022-02-18 15:01:44,296 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.15955543518066406, '10.244.0.60': 0.15950655937194824, '10.244.1.86': 0.15946292877197266}\n - NodeIdleSeconds: Min=55075 Mean=55078 Max=55086\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 15:01:44,297 DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
- MostDelayedHeartbeats: {'10.244.0.59': 0.15955543518066406, '10.244.0.60': 0.15950655937194824, '10.244.1.86': 0.15946292877197266}
- NodeIdleSeconds: Min=55075 Mean=55078 Max=55086
- ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- rayWorkerType: 2
ray,ray:2022-02-18 15:01:44,376 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:44,401 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:44,425 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:44,443 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:44,543 DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'node:10.244.0.59': 1.0, 'CPU': 1.0, 'object_store_memory': 135085670.0, 'memory': 375809638.0}, {'node:10.244.0.60': 1.0, 'object_store_memory': 132774297.0, 'memory': 375809638.0, 'CPU': 1.0}, {'memory': 375809638.0, 'CPU': 1.0, 'object_store_memory': 133639372.0, 'node:10.244.1.86': 1.0}]
ray,ray:2022-02-18 15:01:44,543 DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 15:01:44,543 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 15:01:44,543 DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 15:01:44,543 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 15:01:44,543 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 15:01:44,591 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 15:01:44,643 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"node:10.244.0.59": [0.0, 1.0], "CPU": [0.0, 3.0], "memory": [0.0, 1127428914.0], "object_store_memory": [0.0, 401499339.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"node:10.244.0.59": 1.0, "CPU": 1.0, "memory": 375809638.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225304.1039507, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
ray,ray:2022-02-18 15:01:49,650 DEBUG gcs_utils.py:238 -- internal_kv_get b'autoscaler_resource_request' None
ray,ray:2022-02-18 15:01:49,793 INFO autoscaler.py:304 --
======== Autoscaler status: 2022-02-18 15:01:49.793830 ========
Node status
---------------------------------------------------------------
Healthy:
1 rayHeadType
2 rayWorkerType
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/3.0 CPU
0.00/1.050 GiB memory
0.00/0.374 GiB object_store_memory
Demands:
(no resource demands)
ray,ray:2022-02-18 15:01:49,826 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.14335179328918457, '10.244.0.60': 0.14330291748046875, '10.244.1.86': 0.14326024055480957}\n - NodeIdleSeconds: Min=55080 Mean=55084 Max=55091\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 15:01:49,826 DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
- MostDelayedHeartbeats: {'10.244.0.59': 0.14335179328918457, '10.244.0.60': 0.14330291748046875, '10.244.1.86': 0.14326024055480957}
- NodeIdleSeconds: Min=55080 Mean=55084 Max=55091
- ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- rayWorkerType: 2
ray,ray:2022-02-18 15:01:49,900 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:49,924 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:49,964 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:49,981 DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:50,092 DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'memory': 375809638.0, 'node:10.244.0.59': 1.0, 'object_store_memory': 135085670.0, 'CPU': 1.0}, {'CPU': 1.0, 'node:10.244.0.60': 1.0, 'memory': 375809638.0, 'object_store_memory': 132774297.0}, {'CPU': 1.0, 'object_store_memory': 133639372.0, 'node:10.244.1.86': 1.0, 'memory': 375809638.0}]
ray,ray:2022-02-18 15:01:50,092 DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 15:01:50,092 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 15:01:50,092 DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 15:01:50,092 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 15:01:50,092 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 15:01:50,141 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 15:01:50,189 DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"CPU": [0.0, 3.0], "node:10.244.0.59": [0.0, 1.0], "object_store_memory": [0.0, 401499339.0], "memory": [0.0, 1127428914.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"node:10.244.0.59": 1.0, "CPU": 1.0, "memory": 375809638.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225309.65186, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
I would however expect to get something like this
2022-02-16 16:57:26,326 VINFO scripts.py:853 -- Send termination request to `"/home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:50343" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""` (via SIGTERM)
2022-02-16 16:57:26,328 VINFO scripts.py:853 -- Send termination request to `/home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/raylet --store_socket_name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.1.0.34 --redis_address=10.1.0.34 --redis_port=6379 --maximum_startup_concurrency=1 --static_resource_list=node:10.1.0.34,1.0,memory,367001600,object_store_memory,137668608 "--python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/workers/default_worker.py --node-ip-address=10.1.0.34 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/raylet --redis-address=10.1.0.34:6379 --temp-dir=/tmp/ray --metrics-agent-port=45522 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000" --java_worker_command= "--cpp_worker_command=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/plasma_store --ray_raylet_socket_name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/raylet --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.1.0.34:6379 --ray_redis_password=5241590000000000 --ray_session_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116 --ray_logs_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/logs --ray_node_ip_address=10.1.0.34 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER" --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116 --log_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/logs --resource_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/runtime_resources --metrics-agent-port=45522 --metrics_export_port=43650 --object_store_memory=137668608 --plasma_directory=/dev/shm --ray-debugger-external=0 "--agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.1.0.34 --redis-address=10.1.0.34:6379 --metrics-export-port=43650 --dashboard-agent-port=45522 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116 --runtime-env-dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/runtime_resources --log-dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000"` (via SIGTERM)
Versions / Dependencies
Kubernetes: v.1.21 Ray/Python version: coming from “latest” image
Reproduction script
helm install ray <path-to-helm-chart>
and leaving the Ray cluster to run for awhile reproduces the issue in my case. See above.
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
@AnanthCaspex that’s reasonable.
I’ll drop the experimental label on KubeRay from the Ray docs. Indeed, KubeRay has been performing stably for large-scale internal use-cases at Microsoft and ByteDance for ~1.5 years. KubeRay was labeled “experimental” when we (the core Ray team) were just starting to get familiar with it.
Some features (namely autoscaling) are indeed experimental with KubeRay, but that’s already addressed explicitly in the KubeRay docs.
I’d actually recommend taking a look into the KubeRay project, as that will form the basis for Ray’s preferred K8s support in the future. https://github.com/ray-project/kuberay