Backend SC caused DB pod restarts. It never came to running state
See original GitHub issueEnvironment info
- NooBaa Version: master-20210627
- Platform: OCP 4.6.16
Actual behavior
- Upgrade to master-20210627 caused db pod crash
Expected behavior
1.DB pod shouldn’t crash
Steps to reproduce
- Old code - master-20210622
- Upgraded to master-20210627 in order to retain accounts and buckets
[root@ocp-akshat-1-inf akshat]# oc logs noobaa-db-pg-0 |more
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2021-06-28 05:27:42.521 UTC [25] PANIC: could not read file "global/pg_control": Input/output error
stopped waiting
pg_ctl: could not start server
Examine the log output.
[root@ocp-akshat-1-inf akshat]# oc logs noobaa-db-pg-0 |less -R
[root@ocp-akshat-1-inf akshat]# podn
NAME READY STATUS RESTARTS AGE
noobaa-core-0 1/1 Running 0 22m
noobaa-db-pg-0 0/1 Error 2 22m
noobaa-default-backing-store-noobaa-pod-b0a5d78b 0/1 Terminating 0 4d23h
noobaa-endpoint-6886745f66-rdd4m 1/1 Running 0 22m
noobaa-operator-57d449689c-zb56f 1/1 Running 0 22m
[root@ocp-akshat-1-inf akshat]# oc logs noobaa-db-pg-0 -p |less -R
[root@ocp-akshat-1-inf akshat]# podn
NAME READY STATUS RESTARTS AGE
noobaa-core-0 1/1 Running 0 22m
noobaa-db-pg-0 0/1 CrashLoopBackOff 2 22m
noobaa-default-backing-store-noobaa-pod-b0a5d78b 0/1 Terminating 0 4d23h
noobaa-endpoint-6886745f66-rdd4m 1/1 Running 0 23m
noobaa-operator-57d449689c-zb56f 1/1 Running 0 23m
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 21m default-scheduler Successfully assigned noobaa/noobaa-db-pg-0 to worker2.ocp-akshat-1.cp.fyre.ibm.com
Normal AddedInterface 21m multus Add eth0 [10.254.17.98/22]
Normal Pulling 21m kubelet Pulling image "noobaa/noobaa-core:master-20210627"
Normal Pulled 20m kubelet Successfully pulled image "noobaa/noobaa-core:master-20210627" in 30.126640313s
Normal Created 20m kubelet Created container init
Normal Started 20m kubelet Started container init
Warning Failed 2m34s (x4 over 3m14s) kubelet Error: failed to resolve symlink "/var/lib/kubelet/pods/ce44b338-0155-430c-97d7-5408c230e0b4/volumes/kubernetes.io~csi/pvc-d1c22d45-5f3b-4684-8f4c-48880815f451/mount": lstat /var/mnt/fs1: stale NFS file handle
Normal Pulled 104s (x6 over 20m) kubelet Container image "centos/postgresql-12-centos7" already present on machine
Normal Created 103s (x2 over 20m) kubelet Created container db
Normal Started 103s (x2 over 20m) kubelet Started container db
Warning BackOff 11s (x11 over 2m21s) kubelet Back-off restarting failed container
More information - Screenshots / Logs / Other output
Issue Analytics
- State:
- Created 2 years ago
- Comments:26 (9 by maintainers)
Top Results From Across the Web
[mariadb-galera] Pods restart repeatedly due to ... - GitHub
Pod restarts repeatedly due to readinessProbe failure. It appears that the rootUser might be getting ignored as a hostname is provided and the ......
Read more >A Pod Restarts. So, What's Going on? | by Raju Dawadi
Yes, it re-initiated a new pod and after its ready, the older one gets terminated. When there are many pods running, the get...
Read more >Troubleshooting installation - IBM
To see the status of a particular pod, run the following command: ... pod in the ibm-common-services namespace, which will cause it to...
Read more >Kubernetes CrashLoopBackOff: What it is, and how to fix it?
CrashLoopBackOff is a Kubernetes state representing a restart loop that is happening in a Pod: a container in the Pod is started, ...
Read more >MongoDB Container data loss issue - A Journey
A technical write-up on how I found the cause and fix for why MongoDB ... that writes to a database back end, running...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@nimrod-becker Here is the list of pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-pg-0 Bound pvc-0f3ae11a-971c-4480-9398-d3f37fb145a8 50Gi RWO ibm-spectrum-scale-csi-fileset 6d21h gpfs-vol-pvc-new1 Bound gpfs-pv-3 250Gi RWX 4d20h gpfs-vol-pvc-new11 Bound gpfs-pv-31 250Gi RWX 3d16h noobaa-default-backing-store-noobaa-pvc-1ff51808 Bound pvc-c164a9ac-1855-4788-8219-a2f2ab8ce831 50Gi RWO ibm-spectrum-scale-csi-fileset 6d21h [root@api.ns.cp.fyre.ibm.com ~]#
gpfs-vol-pvc-new1 is used for endpoint pod
We can close this bug based on above finding.
Summary: Noobaa DB pod requires backend NSD’s to be up. So far due to Fyre environment, all NSD’s were not coming up hence Noobaa db pod stayed in Crashed loop. When manually the NSD’s were brought up, the Noobaa db pod came in Running state and IO could continue.