question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Deletion of Ray clusters hangs while Ray operator is still up

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

Problem After executing kubectl delete raycluster <raycluster-name>, I see that the command hangs. It looks like this is because Kubernetes finalizers are preventing the deletion of the resource. Since the Ray operator is up, I would expect the operator to lift the finalizer eventually, but I do not see this happening.

Reproduce I deployed the Ray helm chart onto an AKS cluster, did not run any operation, gave some time, and tried to delete the resource. The deletion however hangs.

Tried

  1. I am already aware that patching the finalizer to null actually deletes the resource immediately. However, I’ve been having issues bringing up new Ray clusters with the same name (this is necessary for our case), so I can’t go with this option every time.
  2. We killed and restarted the Ray operator pod and this terminates all Ray clusters marked for deletion. However, I am not sure if this method is sustainable.

Additional Question

  1. What happens if I restart the ray operator while other rayclusters are active?
  2. What is the finalizer condition here and is it safe to disable it?

Logs I first see that the finalizer condition has been added to the ray cluster.

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  annotations:
    kopf.zalando.org/last-handled-configuration: |
      {"spec":{"headPodType":"rayHeadType","headStartRayCommands":["ray stop","ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0"],"idleTimeoutMinutes":5,"maxWorkers":3,"podTypes":[{"maxWorkers":0,"minWorkers":0,"name":"rayHeadType","podConfig":{"apiVersion":"v1","kind":"Pod","metadata":{"generateName":"ray-head-type-"},"spec":{"containers":[{"args":["trap : TERM INT; sleep infinity & wait;"],"command":["/bin/bash","-c","--"],"env":[{"name":"RAY_gcs_server_rpc_server_thread_num","value":"1"}],"image":"rayproject/ray:latest","imagePullPolicy":"Always","name":"ray-node","ports":[{"containerPort":6379,"protocol":"TCP"},{"containerPort":10001,"protocol":"TCP"},{"containerPort":8265,"protocol":"TCP"},{"containerPort":8000,"protocol":"TCP"}],"resources":{"limits":{"cpu":1,"memory":"512Mi"},"requests":{"cpu":1,"memory":"512Mi"}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"}]}],"restartPolicy":"Never","volumes":[{"emptyDir":{"medium":"Memory"},"name":"dshm"}]}}},{"maxWorkers":3,"minWorkers":2,"name":"rayWorkerType","podConfig":{"apiVersion":"v1","kind":"Pod","metadata":{"generateName":"ray-worker-type-"},"spec":{"containers":[{"args":["trap : TERM INT; sleep infinity & wait;"],"command":["/bin/bash","-c","--"],"env":[{"name":"RAY_gcs_server_rpc_server_thread_num","value":"1"}],"image":"rayproject/ray:latest","imagePullPolicy":"Always","name":"ray-node","ports":[{"containerPort":6379,"protocol":"TCP"},{"containerPort":10001,"protocol":"TCP"},{"containerPort":8265,"protocol":"TCP"},{"containerPort":8000,"protocol":"TCP"}],"resources":{"limits":{"cpu":1,"memory":"512Mi"},"requests":{"cpu":1,"memory":"512Mi"}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"}]}],"restartPolicy":"Never","volumes":[{"emptyDir":{"medium":"Memory"},"name":"dshm"}]}}}],"upscalingSpeed":1,"workerStartRayCommands":["ray stop","ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379"]},"metadata":{"labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"ray","meta.helm.sh/release-namespace":"ray"}},"status":{"autoscalerRetries":0}}
    meta.helm.sh/release-name: ray
    meta.helm.sh/release-namespace: ray
  creationTimestamp: "2022-02-18T07:43:26Z"
  finalizers:
  - kopf.zalando.org/KopfFinalizerMarker
  generation: 1

I see that the operator is up and running as well.

  NAME                           READY   STATUS    RESTARTS   AGE
ray-operator-b4cdbf848-qfn7r   1/1     Running   0          15h
ray-ray-head-type-8fbvd        1/1     Running   0          15h
ray-ray-worker-type-jjv77      1/1     Running   0          15h
ray-ray-worker-type-k6r4d      1/1     Running   0          15h

I also see the operator is properly monitoring the Ray cluster resource.

======== Autoscaler status: 2022-02-18 14:59:47.002896 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
 2 rayWorkerType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/3.0 CPU
 0.00/1.050 GiB memory
 0.00/0.374 GiB object_store_memory

Demands:
 (no resource demands)
ray,ray:2022-02-18 14:59:47,037	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.15790367126464844, '10.244.0.60': 0.15787005424499512, '10.244.1.86': 0.15784144401550293}\n - NodeIdleSeconds: Min=54958 Mean=54961 Max=54969\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 14:59:47,038	DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
 - MostDelayedHeartbeats: {'10.244.0.59': 0.15790367126464844, '10.244.0.60': 0.15787005424499512, '10.244.1.86': 0.15784144401550293}
 - NodeIdleSeconds: Min=54958 Mean=54961 Max=54969
 - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - rayWorkerType: 2
ray,ray:2022-02-18 14:59:47,112	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:47,137	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:47,163	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:47,180	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:47,315	DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'object_store_memory': 135085670.0, 'memory': 375809638.0, 'node:10.244.0.59': 1.0, 'CPU': 1.0}, {'node:10.244.0.60': 1.0, 'object_store_memory': 132774297.0, 'memory': 375809638.0, 'CPU': 1.0}, {'CPU': 1.0, 'memory': 375809638.0, 'object_store_memory': 133639372.0, 'node:10.244.1.86': 1.0}]
ray,ray:2022-02-18 14:59:47,315	DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 14:59:47,315	DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 14:59:47,315	DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 14:59:47,315	DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 14:59:47,315	DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 14:59:47,366	DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 14:59:47,427	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"node:10.244.0.59": [0.0, 1.0], "CPU": [0.0, 3.0], "object_store_memory": [0.0, 401499339.0], "memory": [0.0, 1127428914.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"node:10.244.0.59": 1.0, "CPU": 1.0, "memory": 375809638.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225186.8461728, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
ray,ray:2022-02-18 14:59:52,430	DEBUG gcs_utils.py:238 -- internal_kv_get b'autoscaler_resource_request' None
ray,ray:2022-02-18 14:59:52,675	INFO autoscaler.py:304 --
======== Autoscaler status: 2022-02-18 14:59:52.675515 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
 2 rayWorkerType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/3.0 CPU
 0.00/1.050 GiB memory
 0.00/0.374 GiB object_store_memory

Demands:
 (no resource demands)
ray,ray:2022-02-18 14:59:52,712	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.2449197769165039, '10.244.0.60': 0.24487924575805664, '10.244.1.86': 0.24484586715698242}\n - NodeIdleSeconds: Min=54963 Mean=54967 Max=54974\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 14:59:52,713	DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
 - MostDelayedHeartbeats: {'10.244.0.59': 0.2449197769165039, '10.244.0.60': 0.24487924575805664, '10.244.1.86': 0.24484586715698242}
 - NodeIdleSeconds: Min=54963 Mean=54967 Max=54974
 - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - rayWorkerType: 2
ray,ray:2022-02-18 14:59:52,791	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:52,817	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:52,843	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:52,864	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 14:59:52,966	DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'CPU': 1.0, 'node:10.244.0.59': 1.0, 'memory': 375809638.0, 'object_store_memory': 135085670.0}, {'memory': 375809638.0, 'object_store_memory': 132774297.0, 'node:10.244.0.60': 1.0, 'CPU': 1.0}, {'memory': 375809638.0, 'node:10.244.1.86': 1.0, 'CPU': 1.0, 'object_store_memory': 133639372.0}]
ray,ray:2022-02-18 14:59:52,966	DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 14:59:52,966	DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 14:59:52,966	DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 14:59:52,966	DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 14:59:52,966	DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 14:59:53,030	DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 14:59:53,092	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"CPU": [0.0, 3.0], "object_store_memory": [0.0, 401499339.0], "node:10.244.0.59": [0.0, 1.0], "memory": [0.0, 1127428914.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"node:10.244.0.59": 1.0, "CPU": 1.0, "memory": 375809638.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225192.4319715, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None

After attempting to delete the resource with kubectl delete raycluster <raycluster-name>, I see that the resource has been marked for deletion but the command hangs.

apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
  annotations:
    kopf.zalando.org/last-handled-configuration: |
      {"spec":{"headPodType":"rayHeadType","headStartRayCommands":["ray stop","ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0"],"idleTimeoutMinutes":5,"maxWorkers":3,"podTypes":[{"maxWorkers":0,"minWorkers":0,"name":"rayHeadType","podConfig":{"apiVersion":"v1","kind":"Pod","metadata":{"generateName":"ray-head-type-"},"spec":{"containers":[{"args":["trap : TERM INT; sleep infinity & wait;"],"command":["/bin/bash","-c","--"],"env":[{"name":"RAY_gcs_server_rpc_server_thread_num","value":"1"}],"image":"rayproject/ray:latest","imagePullPolicy":"Always","name":"ray-node","ports":[{"containerPort":6379,"protocol":"TCP"},{"containerPort":10001,"protocol":"TCP"},{"containerPort":8265,"protocol":"TCP"},{"containerPort":8000,"protocol":"TCP"}],"resources":{"limits":{"cpu":1,"memory":"512Mi"},"requests":{"cpu":1,"memory":"512Mi"}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"}]}],"restartPolicy":"Never","volumes":[{"emptyDir":{"medium":"Memory"},"name":"dshm"}]}}},{"maxWorkers":3,"minWorkers":2,"name":"rayWorkerType","podConfig":{"apiVersion":"v1","kind":"Pod","metadata":{"generateName":"ray-worker-type-"},"spec":{"containers":[{"args":["trap : TERM INT; sleep infinity & wait;"],"command":["/bin/bash","-c","--"],"env":[{"name":"RAY_gcs_server_rpc_server_thread_num","value":"1"}],"image":"rayproject/ray:latest","imagePullPolicy":"Always","name":"ray-node","ports":[{"containerPort":6379,"protocol":"TCP"},{"containerPort":10001,"protocol":"TCP"},{"containerPort":8265,"protocol":"TCP"},{"containerPort":8000,"protocol":"TCP"}],"resources":{"limits":{"cpu":1,"memory":"512Mi"},"requests":{"cpu":1,"memory":"512Mi"}},"volumeMounts":[{"mountPath":"/dev/shm","name":"dshm"}]}],"restartPolicy":"Never","volumes":[{"emptyDir":{"medium":"Memory"},"name":"dshm"}]}}}],"upscalingSpeed":1,"workerStartRayCommands":["ray stop","ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379"]},"metadata":{"labels":{"app.kubernetes.io/managed-by":"Helm"},"annotations":{"meta.helm.sh/release-name":"ray","meta.helm.sh/release-namespace":"ray"}},"status":{"autoscalerRetries":0}}
    meta.helm.sh/release-name: ray
    meta.helm.sh/release-namespace: ray
  creationTimestamp: "2022-02-18T07:43:26Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-02-18T23:01:10Z"
  finalizers:
  - kopf.zalando.org/KopfFinalizerMarker
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
  name: ray
  namespace: ray
  resourceVersion: "80922414"
  uid: e9782156-db7f-4795-97eb-680b8b149bd5

I also see that the Ray operator is still monitoring the resource.

======== Autoscaler status: 2022-02-18 15:01:33.084575 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
 2 rayWorkerType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/3.0 CPU
 0.00/1.050 GiB memory
 0.00/0.374 GiB object_store_memory

Demands:
 (no resource demands)
ray,ray:2022-02-18 15:01:33,173	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.18732261657714844, '10.244.0.60': 0.1872880458831787, '10.244.1.86': 0.18725895881652832}\n - NodeIdleSeconds: Min=55064 Mean=55067 Max=55075\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 15:01:33,174	DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
 - MostDelayedHeartbeats: {'10.244.0.59': 0.18732261657714844, '10.244.0.60': 0.1872880458831787, '10.244.1.86': 0.18725895881652832}
 - NodeIdleSeconds: Min=55064 Mean=55067 Max=55075
 - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - rayWorkerType: 2
ray,ray:2022-02-18 15:01:33,262	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:33,293	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:33,320	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:33,336	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:33,444	DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'node:10.244.0.59': 1.0, 'memory': 375809638.0, 'CPU': 1.0, 'object_store_memory': 135085670.0}, {'object_store_memory': 132774297.0, 'node:10.244.0.60': 1.0, 'CPU': 1.0, 'memory': 375809638.0}, {'object_store_memory': 133639372.0, 'memory': 375809638.0, 'node:10.244.1.86': 1.0, 'CPU': 1.0}]
ray,ray:2022-02-18 15:01:33,444	DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 15:01:33,445	DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 15:01:33,445	DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 15:01:33,445	DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 15:01:33,445	DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 15:01:33,499	DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 15:01:33,552	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"memory": [0.0, 1127428914.0], "object_store_memory": [0.0, 401499339.0], "node:10.244.0.59": [0.0, 1.0], "CPU": [0.0, 3.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"memory": 375809638.0, "CPU": 1.0, "node:10.244.0.59": 1.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225292.8986323, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
ray,ray:2022-02-18 15:01:38,558	DEBUG gcs_utils.py:238 -- internal_kv_get b'autoscaler_resource_request' None
ray,ray:2022-02-18 15:01:38,705	INFO autoscaler.py:304 --
======== Autoscaler status: 2022-02-18 15:01:38.705050 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
 2 rayWorkerType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/3.0 CPU
 0.00/1.050 GiB memory
 0.00/0.374 GiB object_store_memory

Demands:
 (no resource demands)
ray,ray:2022-02-18 15:01:38,738	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.14638352394104004, '10.244.0.60': 0.1463487148284912, '10.244.1.86': 0.14631962776184082}\n - NodeIdleSeconds: Min=55069 Mean=55073 Max=55080\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 15:01:38,739	DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
 - MostDelayedHeartbeats: {'10.244.0.59': 0.14638352394104004, '10.244.0.60': 0.1463487148284912, '10.244.1.86': 0.14631962776184082}
 - NodeIdleSeconds: Min=55069 Mean=55073 Max=55080
 - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - rayWorkerType: 2
ray,ray:2022-02-18 15:01:38,821	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:38,845	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:38,872	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:38,889	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:38,996	DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'node:10.244.0.59': 1.0, 'CPU': 1.0, 'object_store_memory': 135085670.0, 'memory': 375809638.0}, {'memory': 375809638.0, 'node:10.244.0.60': 1.0, 'CPU': 1.0, 'object_store_memory': 132774297.0}, {'object_store_memory': 133639372.0, 'memory': 375809638.0, 'CPU': 1.0, 'node:10.244.1.86': 1.0}]
ray,ray:2022-02-18 15:01:38,996	DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 15:01:38,996	DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 15:01:38,996	DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 15:01:38,996	DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 15:01:38,996	DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 15:01:39,045	DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 15:01:39,095	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"memory": [0.0, 1127428914.0], "CPU": [0.0, 3.0], "object_store_memory": [0.0, 401499339.0], "node:10.244.0.59": [0.0, 1.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"memory": 375809638.0, "CPU": 1.0, "node:10.244.0.59": 1.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225298.5599062, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
ray,ray:2022-02-18 15:01:44,102	DEBUG gcs_utils.py:238 -- internal_kv_get b'autoscaler_resource_request' None
ray,ray:2022-02-18 15:01:44,262	INFO autoscaler.py:304 --
======== Autoscaler status: 2022-02-18 15:01:44.261951 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
 2 rayWorkerType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/3.0 CPU
 0.00/1.050 GiB memory
 0.00/0.374 GiB object_store_memory

Demands:
 (no resource demands)
ray,ray:2022-02-18 15:01:44,296	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.15955543518066406, '10.244.0.60': 0.15950655937194824, '10.244.1.86': 0.15946292877197266}\n - NodeIdleSeconds: Min=55075 Mean=55078 Max=55086\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 15:01:44,297	DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
 - MostDelayedHeartbeats: {'10.244.0.59': 0.15955543518066406, '10.244.0.60': 0.15950655937194824, '10.244.1.86': 0.15946292877197266}
 - NodeIdleSeconds: Min=55075 Mean=55078 Max=55086
 - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - rayWorkerType: 2
ray,ray:2022-02-18 15:01:44,376	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:44,401	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:44,425	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:44,443	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:44,543	DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'node:10.244.0.59': 1.0, 'CPU': 1.0, 'object_store_memory': 135085670.0, 'memory': 375809638.0}, {'node:10.244.0.60': 1.0, 'object_store_memory': 132774297.0, 'memory': 375809638.0, 'CPU': 1.0}, {'memory': 375809638.0, 'CPU': 1.0, 'object_store_memory': 133639372.0, 'node:10.244.1.86': 1.0}]
ray,ray:2022-02-18 15:01:44,543	DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 15:01:44,543	DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 15:01:44,543	DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 15:01:44,543	DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 15:01:44,543	DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 15:01:44,591	DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 15:01:44,643	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"node:10.244.0.59": [0.0, 1.0], "CPU": [0.0, 3.0], "memory": [0.0, 1127428914.0], "object_store_memory": [0.0, 401499339.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"node:10.244.0.59": 1.0, "CPU": 1.0, "memory": 375809638.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225304.1039507, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
ray,ray:2022-02-18 15:01:49,650	DEBUG gcs_utils.py:238 -- internal_kv_get b'autoscaler_resource_request' None
ray,ray:2022-02-18 15:01:49,793	INFO autoscaler.py:304 --
======== Autoscaler status: 2022-02-18 15:01:49.793830 ========
Node status
---------------------------------------------------------------
Healthy:
 1 rayHeadType
 2 rayWorkerType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/3.0 CPU
 0.00/1.050 GiB memory
 0.00/0.374 GiB object_store_memory

Demands:
 (no resource demands)
ray,ray:2022-02-18 15:01:49,826	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 2 nodes\n - MostDelayedHeartbeats: {'10.244.0.59': 0.14335179328918457, '10.244.0.60': 0.14330291748046875, '10.244.1.86': 0.14326024055480957}\n - NodeIdleSeconds: Min=55080 Mean=55084 Max=55091\n - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - rayWorkerType: 2" True None
ray,ray:2022-02-18 15:01:49,826	DEBUG legacy_info_string.py:24 -- Cluster status: 2 nodes
 - MostDelayedHeartbeats: {'10.244.0.59': 0.14335179328918457, '10.244.0.60': 0.14330291748046875, '10.244.1.86': 0.14326024055480957}
 - NodeIdleSeconds: Min=55080 Mean=55084 Max=55091
 - ResourceUsage: 0.0/3.0 CPU, 0.0 GiB/1.05 GiB memory, 0.0 GiB/0.37 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - rayWorkerType: 2
ray,ray:2022-02-18 15:01:49,900	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:49,924	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:49,964	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-jjv77 is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:49,981	DEBUG autoscaler.py:1148 -- ray-ray-worker-type-k6r4d is not being updated and passes config check (can_update=True).
ray,ray:2022-02-18 15:01:50,092	DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'memory': 375809638.0, 'node:10.244.0.59': 1.0, 'object_store_memory': 135085670.0, 'CPU': 1.0}, {'CPU': 1.0, 'node:10.244.0.60': 1.0, 'memory': 375809638.0, 'object_store_memory': 132774297.0}, {'CPU': 1.0, 'object_store_memory': 133639372.0, 'node:10.244.1.86': 1.0, 'memory': 375809638.0}]
ray,ray:2022-02-18 15:01:50,092	DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'rayHeadType': 1, 'rayWorkerType': 2})
ray,ray:2022-02-18 15:01:50,092	DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
ray,ray:2022-02-18 15:01:50,092	DEBUG resource_demand_scheduler.py:247 -- Resource demands: []
ray,ray:2022-02-18 15:01:50,092	DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: []
ray,ray:2022-02-18 15:01:50,092	DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: []
ray,ray:2022-02-18 15:01:50,141	DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
ray,ray:2022-02-18 15:01:50,189	DEBUG gcs_utils.py:253 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"CPU": [0.0, 3.0], "node:10.244.0.59": [0.0, 1.0], "object_store_memory": [0.0, 401499339.0], "memory": [0.0, 1127428914.0], "node:10.244.0.60": [0.0, 1.0], "node:10.244.1.86": [0.0, 1.0]}, "resource_demand": [], "pg_demand": [], "request_demand": [], "node_types": [[{"node:10.244.0.59": 1.0, "CPU": 1.0, "memory": 375809638.0, "object_store_memory": 135085670.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 132774297.0, "node:10.244.0.60": 1.0}, 1], [{"memory": 375809638.0, "CPU": 1.0, "object_store_memory": 133639372.0, "node:10.244.1.86": 1.0}, 1]], "head_ip": null}, "time": 1645225309.65186, "monitor_pid": 58, "autoscaler_report": {"active_nodes": {"rayHeadType": 1, "rayWorkerType": 2}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None

I would however expect to get something like this

2022-02-16 16:57:26,326 VINFO scripts.py:853 -- Send termination request to `"/home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:50343" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""` (via SIGTERM)
2022-02-16 16:57:26,328 VINFO scripts.py:853 -- Send termination request to `/home/ray/anaconda3/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/raylet --store_socket_name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.1.0.34 --redis_address=10.1.0.34 --redis_port=6379 --maximum_startup_concurrency=1 --static_resource_list=node:10.1.0.34,1.0,memory,367001600,object_store_memory,137668608 "--python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.9/site-packages/ray/workers/setup_worker.py /home/ray/anaconda3/lib/python3.9/site-packages/ray/workers/default_worker.py --node-ip-address=10.1.0.34 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/raylet --redis-address=10.1.0.34:6379 --temp-dir=/tmp/ray --metrics-agent-port=45522 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000" --java_worker_command= "--cpp_worker_command=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/plasma_store --ray_raylet_socket_name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/raylet --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=10.1.0.34:6379 --ray_redis_password=5241590000000000 --ray_session_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116 --ray_logs_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/logs --ray_node_ip_address=10.1.0.34 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER" --native_library_path=/home/ray/anaconda3/lib/python3.9/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116 --log_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/logs --resource_dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/runtime_resources --metrics-agent-port=45522 --metrics_export_port=43650 --object_store_memory=137668608 --plasma_directory=/dev/shm --ray-debugger-external=0 "--agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-address=10.1.0.34 --redis-address=10.1.0.34:6379 --metrics-export-port=43650 --dashboard-agent-port=45522 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-16_16-41-20_595437_116/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116 --runtime-env-dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/runtime_resources --log-dir=/tmp/ray/session_2022-02-16_16-41-20_595437_116/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000"` (via SIGTERM)

Versions / Dependencies

Kubernetes: v.1.21 Ray/Python version: coming from “latest” image

Reproduction script

helm install ray <path-to-helm-chart> and leaving the Ray cluster to run for awhile reproduces the issue in my case. See above.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
DmitriGekhtmancommented, Jun 21, 2022

@AnanthCaspex that’s reasonable.

I’ll drop the experimental label on KubeRay from the Ray docs. Indeed, KubeRay has been performing stably for large-scale internal use-cases at Microsoft and ByteDance for ~1.5 years. KubeRay was labeled “experimental” when we (the core Ray team) were just starting to get familiar with it.

Some features (namely autoscaling) are indeed experimental with KubeRay, but that’s already addressed explicitly in the KubeRay docs.

1reaction
DmitriGekhtmancommented, Mar 1, 2022

I’d actually recommend taking a look into the KubeRay project, as that will form the basis for Ray’s preferred K8s support in the future. https://github.com/ray-project/kuberay

Read more comments on GitHub >

github_iconTop Results From Across the Web

Starting and stopping Ray clusters on Kubernetes fails
The second workaround that I have done is to restart the Ray operator (e.g. kubectl delete ), which seems to make all the...
Read more >
Disabling Ray Finalizer Condition - kubernetes - Stack Overflow
Assuming the operator is running when you try to delete the resource, the hanging behavior is a bug. Would you mind filing a...
Read more >
Bug listing with status RESOLVED with resolution FIXED as at ...
Bug :2 - "How do I attach an ebuild. ... Bug:1932 - "emerge does not pick up dependency change when ebuild was updated...
Read more >
2005040 – Uninstallation of ODF StorageSystem via OCP ...
Github, red-hat-storage ocs-operator pull 1566, 0, None, open, Bug 2005040: ... The datapoint above was observed during removal of storage cluster CR which ......
Read more >
Troubleshooting On-Demand Ray Clusters in Domino
Dependencies: · Compute Environments: · RayCluster Custom Resource Definition: · Ray Cluster: · Distributed Compute Operator: · Nucleus.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found