[Bug] Failure to delete namespace containing RayCluster with kopf finalizer
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Clusters
What happened + What you expected to happen
It is not possible to directly delete a namespace containing a Ray cluster and its associated operator when using namespaced operators. After deleting it, pods will be deleted but the RayCluster
will stay alive and the namespace will be stuck in “Terminating” state.
We recently switched to using namespaced Ray operators. We are using GKE to host the Ray clusters and Ray version 1.9.1. When launching a new Ray cluster, we create a namespace containing a RayCluster
object and associated operator, based on this namespaced operator template.
Deleting the namespace works if the RayCluster
is first patched to remove the finalizer using an approach such as this one, and deleted prior to deleting the namespace.
We did not have this problem in the past when using a cluster-level Ray operator (common to the whole Kubernetes cluster).
Versions / Dependencies
Python 3.7.9 Ray 1.9.1 Ubuntu 18.04
Reproduction script
N/A, see description
An example configuration of a RayCluster
k8s resource with finalizer is as follows (cropped output):
Name: olivier-test
Namespace: merlin-olivier-test
Labels: app=ray
maxAgeHours=8
owner=olivier.labreche
workspace=olivier-test
Annotations: kopf.zalando.org/last-handled-configuration:
{"spec":{"headPodType":"head-node","headServicePorts":[{"name":"client","port":10001,"targetPort":10001},{"name":"dashboard","port":8265,"...
API Version: cluster.ray.io/v1
Kind: RayCluster
Metadata:
Creation Timestamp: 2022-01-18T21:33:20Z
Deletion Grace Period Seconds: 0
Deletion Timestamp: 2022-01-18T21:52:51Z
Finalizers:
kopf.zalando.org/KopfFinalizerMarker
Generation: 2
Managed Fields:
API Version: cluster.ray.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.:
f:app:
f:maxAgeHours:
f:owner:
f:workspace:
f:spec:
.:
f:headPodType:
f:headServicePorts:
f:headStartRayCommands:
f:idleTimeoutMinutes:
f:maxWorkers:
f:podTypes:
f:upscalingSpeed:
f:workerStartRayCommands:
f:status:
.:
f:autoscalerRetries:
f:phase:
Manager: OpenAPI-Generator
Operation: Update
Time: 2022-01-18T21:33:22Z
API Version: cluster.ray.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kopf.zalando.org/last-handled-configuration:
f:finalizers:
f:status:
f:kopf:
.:
f:progress:
Manager: kopf
Operation: Update
Time: 2022-01-18T21:34:44Z
Resource Version: 149236356
UID: a2da6e0f-7f1f-4da8-81ea-01b2a70569a5
<cropped output>
Anything else
This problem is always occurring if the finalizer is not first patched out.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Same cause as https://github.com/ray-project/ray/issues/18212.
If the operator is deleted before RayCluster object, the RayCluster object is stuck with a finalizer that must be removed either manually or by restarting the operator.
The work-around is to delete all RayCluster objects before deleting the namespace.
Actually @DmitriGekhtman the reason why we are patching out the finalizer is that it will otherwise prevent deleting the RayCluster if a pod is in
ImagePullBackOff
state.