question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Failure to delete namespace containing RayCluster with kopf finalizer

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

It is not possible to directly delete a namespace containing a Ray cluster and its associated operator when using namespaced operators. After deleting it, pods will be deleted but the RayCluster will stay alive and the namespace will be stuck in “Terminating” state.

We recently switched to using namespaced Ray operators. We are using GKE to host the Ray clusters and Ray version 1.9.1. When launching a new Ray cluster, we create a namespace containing a RayCluster object and associated operator, based on this namespaced operator template.

Deleting the namespace works if the RayCluster is first patched to remove the finalizer using an approach such as this one, and deleted prior to deleting the namespace.

We did not have this problem in the past when using a cluster-level Ray operator (common to the whole Kubernetes cluster).

Versions / Dependencies

Python 3.7.9 Ray 1.9.1 Ubuntu 18.04

Reproduction script

N/A, see description

An example configuration of a RayCluster k8s resource with finalizer is as follows (cropped output):

Name:         olivier-test
Namespace:    merlin-olivier-test
Labels:       app=ray
              maxAgeHours=8
              owner=olivier.labreche
              workspace=olivier-test
Annotations:  kopf.zalando.org/last-handled-configuration:
                {"spec":{"headPodType":"head-node","headServicePorts":[{"name":"client","port":10001,"targetPort":10001},{"name":"dashboard","port":8265,"...
API Version:  cluster.ray.io/v1
Kind:         RayCluster
Metadata:
  Creation Timestamp:             2022-01-18T21:33:20Z
  Deletion Grace Period Seconds:  0
  Deletion Timestamp:             2022-01-18T21:52:51Z
  Finalizers:
    kopf.zalando.org/KopfFinalizerMarker
  Generation:  2
  Managed Fields:
    API Version:  cluster.ray.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:app:
          f:maxAgeHours:
          f:owner:
          f:workspace:
      f:spec:
        .:
        f:headPodType:
        f:headServicePorts:
        f:headStartRayCommands:
        f:idleTimeoutMinutes:
        f:maxWorkers:
        f:podTypes:
        f:upscalingSpeed:
        f:workerStartRayCommands:
      f:status:
        .:
        f:autoscalerRetries:
        f:phase:
    Manager:      OpenAPI-Generator
    Operation:    Update
    Time:         2022-01-18T21:33:22Z
    API Version:  cluster.ray.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kopf.zalando.org/last-handled-configuration:
        f:finalizers:
      f:status:
        f:kopf:
          .:
          f:progress:
    Manager:         kopf
    Operation:       Update
    Time:            2022-01-18T21:34:44Z
  Resource Version:  149236356
  UID:               a2da6e0f-7f1f-4da8-81ea-01b2a70569a5

<cropped output>

Anything else

This problem is always occurring if the finalizer is not first patched out.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
DmitriGekhtmancommented, Jan 21, 2022

Same cause as https://github.com/ray-project/ray/issues/18212.

If the operator is deleted before RayCluster object, the RayCluster object is stuck with a finalizer that must be removed either manually or by restarting the operator.

The work-around is to delete all RayCluster objects before deleting the namespace.

0reactions
olivierlabrechecommented, Feb 17, 2022

That sounds like a bug. If the operator is up, it should be able to remove the finalizer. Could you share operator logs from when the operator fails to remove the finalizer?

Actually @DmitriGekhtman the reason why we are patching out the finalizer is that it will otherwise prevent deleting the RayCluster if a pod is in ImagePullBackOff state.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to fix Kubernetes namespaces stuck in the terminating state
When an object has been terminating for an excessive time, check its finalizers by inspecting the metadata.finalizers field in its YAML.
Read more >
Using Finalizers to Control Deletion - Kubernetes
Authors: Aaron Alpar (Kasten) Deleting objects in Kubernetes can be challenging. You may think you've deleted something, only to find it ...
Read more >
Unable to create cluster in kubernetes namespace - Ray.io
Hello, I have a Kubernetes cluster with version as: kubectl version Client Version: version.Info{Major:“1”, Minor:“16”, GitVersion:“v1.16.1” ...
Read more >
Stop Messing with Kubernetes Finalizers | Martin Heinz
When you delete an object which has a finalizer, ... still exist in the namespace that the namespace controller is unable to remove....
Read more >
kopf package — Kopf documentation - Read the Docs
The main Kopf module for all the exported functions & classes. kopf.register(fn, * ... Remove an owner reference to the resource(s), if it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found