Operator pod shows panic and restarted when shutting down the node on which endpoint is scheduled
See original GitHub issueEnvironment info
[root@api.ns.cp.fyre.ibm.com ~]# oc version Client Version: 4.7.13 Server Version: 4.7.13 Kubernetes Version: v1.20.0+df9c838 [root@api.ns.cp.fyre.ibm.com ~]# noobaa version INFO[0000] CLI version: 5.9.0 INFO[0000] noobaa-image: noobaa/noobaa-core:master-20210719 INFO[0000] operator-image: noobaa/noobaa-operator:5.9.0 [root@api.ns.cp.fyre.ibm.com ~]#
Actual behavior
- Operator pod shows panic and restarted when shutting down the node on which endpoint is scheduled
Expected behavior
- No panic should be shown in operator logs and operator pod should not have restarted
Steps to reproduce
- Install noobaa and start a copy object operation into a bucket
- While doing copy operation shutdown the node on which noobaa is installed (I have only endpoint pod scheduled on that node , no other noobaa pode)
- Start the node
Inf node: [root@api.ns.cp.fyre.ibm.com ~]# oc get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
noobaa-core-0 1/1 Running 0 73m 10.254.3.167 master1.ns.cp.fyre.ibm.com
noobaa-db-pg-0 1/1 Running 0 60m 10.254.4.17 master0.ns.cp.fyre.ibm.com
noobaa-default-backing-store-noobaa-pod-62daf8d7 0/1 Terminating 0 38m master2.ns.cp.fyre.ibm.com
noobaa-endpoint-565dbbd667-gfzt2 1/1 Running 0 74m 10.254.4.14 master0.ns.cp.fyre.ibm.com
noobaa-operator-6d54447bc5-hr7sb 1/1 Running 1 19h 10.254.3.136 master1.ns.cp.fyre.ibm.com
[root@api.ns.cp.fyre.ibm.com ~]#
More information - Screenshots / Logs / Other output
Issue Analytics
- State:
- Created 2 years ago
- Comments:20 (12 by maintainers)
Top Results From Across the Web
Kubernetes 1.24: Introducing Non-Graceful Node Shutdown ...
Graceful Node Shutdown allows Kubernetes to detect when a node is shutting down cleanly, and handles that situation appropriately. A Node ...
Read more >[BUG] Getting "webhook configurations error" #2179 - GitHub
We are in GKE using preemptible nodes, which means that our nodes shutdown and recycle at least 1x a day and evict any...
Read more >RHBA-2020:2409 - Bug Fix Advisory - Red Hat Customer Portal
BZ - 1809747 - [ovn-kubernetes] When a node gets deleted, the Chassis record for that node is not deleted from the sbdb. BZ...
Read more >Shutting down a cluster gracefully | Backup and restore
To use host binaries, run `chroot /host` Shutdown scheduled for Mon 2021-09-13 09:36:29 UTC, use 'shutdown -c' to cancel. Shutting down the nodes...
Read more >Known issues and limitations - IBM
The workaround is to stop the kube-controller-manager leader container on the master nodes and let it restart. If high availability is configured for...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@Igor and I discussed a solution that instead of panicking immediately when encountering an unknown error, the operator will return a temp error and the reconcile will requeu. If it reoccurs several times then the operator will panic. @nimrod-becker WDYT?
AFAIU it’s not a recurring panic, and after the operator restarted it did not happen again. @nehasharma5 am I right?
if so I think we should keep the panic and not change it. the panic is there to avoid silent failures when encountering unknown errors. if we see that this specific error is repeating in many cases maybe we can ignore it specifically, but I wouldn’t remove the panic entirely. @igorpick @nimrod-becker WDYT?