question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved

See original GitHub issue

Environment info

  • NooBaa Version: VERSION
  • Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify

Noobaa version is RC code of the ODF 4.9.0

noobaa status INFO[0000] CLI version: 5.9.0 INFO[0000] noobaa-image: quay.io/rhceph-dev/mcg-core@sha256:6ce2ddee7aff6a0e768fce523a77c998e1e48e25d227f93843d195d65ebb81b9 INFO[0000] operator-image: quay.io/rhceph-dev/mcg-operator@sha256:cc293c7fe0fdfe3812f9d1af30b6f9c59e97d00c4727c4463a5b9d3429f4278e INFO[0000] noobaa-db-image: registry.redhat.io/rhel8/postgresql-12@sha256:b3e5b7bc6acd6422f928242d026171bcbed40ab644a2524c84e8ccb4b1ac48ff INFO[0000] Namespace: openshift-storage

oc version Client Version: 4.9.5 Server Version: 4.9.5 Kubernetes Version: v1.22.0-rc.0+a44d0f0

Actual behavior

Expected behavior

Steps to reproduce

Configured MetalLB on the cluster (which shouldn’t matter) for this problem description though… Have the Noobaa core/db and 3 endpoints are running on the respective worker nodes as shown below

NAME                                               READY   STATUS    RESTARTS        AGE    IP              NODE                                  NOMINATED NO
DE   READINESS GATES
noobaa-core-0                                      1/1     Running   0               20d    10.254.14.77    worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-db-pg-0                                     1/1     Running   3 (17d ago)     20d    10.254.18.0     worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-default-backing-store-noobaa-pod-a1bf952a   1/1     Running   0               20d    10.254.18.4     worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-endpoint-bfffdd599-7jzdf                    1/1     Running   0               3d1h   10.254.20.43    worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-endpoint-bfffdd599-gxz5h                    1/1     Running   0               3d4h   10.254.15.112   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-endpoint-bfffdd599-mbfrj                    1/1     Running   0               3d4h   10.254.17.208   worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-operator-5c46775cdd-vplhr                   1/1     Running   0               31d    10.254.16.22    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>

Step 2: Issued a kubelet service stop on the node where the noobaa-db pg pod is running 

[core@worker0 ~]$ sudo systemctl stop kubelet

Step 3:  noobaa-db-pg pod trying to migrate to worker2 from worker0 , noobaa operator restarted,  noobaa endpoint on worker0 has got into Pending state as expected 

NAME                                               READY   STATUS              RESTARTS         AGE    IP              NODE                                  N
OMINATED NODE   READINESS GATES
noobaa-core-0                                      1/1     Running             0                20d    10.254.14.77    worker1.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>
noobaa-db-pg-0                                     0/1     Init:0/2            0                6s     <none>          worker2.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>
noobaa-endpoint-bfffdd599-7jzdf                    1/1     Running             0                3d1h   10.254.20.43    worker2.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>
noobaa-endpoint-bfffdd599-gxz5h                    1/1     Running             0                3d4h   10.254.15.112   worker1.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>
noobaa-endpoint-bfffdd599-wlktz                    0/1     Pending             0                6s     <none>          <none>                                <
none>           <none>
noobaa-operator-5c46775cdd-9mgxt                   0/1     ContainerCreating   0                6s     <none>          worker2.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>

Step 4: Noobaa-db-pg pod continues to be in the Init state on the worker2 

NAME                                               READY   STATUS        RESTARTS         AGE     IP              NODE                                  NOMINA
TED NODE   READINESS GATES
noobaa-core-0                                      1/1     Running       0                20d     10.254.14.77    worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
noobaa-db-pg-0                                     0/1     Init:0/2      0                7m52s   <none>          worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
noobaa-endpoint-bfffdd599-7jzdf                    1/1     Running       0                3d1h    10.254.20.43    worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
noobaa-endpoint-bfffdd599-gxz5h                    1/1     Running       0                3d4h    10.254.15.112   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
noobaa-endpoint-bfffdd599-wlktz                    0/1     Pending       0                7m52s   <none>          <none>                                <none>
           <none>
noobaa-operator-5c46775cdd-9mgxt                   1/1     Running       0                7m52s   10.254.20.72    worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
           
  Step 5: When described the noobaa-db-pg-0 , it showed the pvc was bound on the worker0 node can't be bound to worker2 node. 
  
  Events:
  Type     Reason              Age                    From                     Message
  ----     ------              ----                   ----                     -------
  Normal   Scheduled           11m                    default-scheduler        Successfully assigned openshift-storage/noobaa-db-pg-0 to worker2.rkomandu-ta.cp.fyre.ibm.com
  Warning  FailedAttachVolume  11m                    attachdetach-controller  Multi-Attach error for volume "pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount         2m42s (x4 over 9m31s)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[db kube-api-access-89bwb noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume]: timed out waiting for the condition
  Warning  FailedMount         25s                    kubelet                  Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume db kube-api-access-89bwb]: timed out waiting for the condition
  Warning  FailedAttachVolume  16s (x10 over 4m31s)   attachdetach-controller  AttachVolume.Attach failed for volume "pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca" : rpc error: code = Internal desc = ControllerPublishVolume : Error in getting filesystem Name for filesystem ID of 0D790B0A:61B0F1B9. Error [Get "https://ibm-spectrum-scale-gui.ibm-spectrum-scale:443/scalemgmt/v2/filesystems?filter=uuid=0D790B0A:61B0F1B9": context deadline exceeded (Client.Timeout exceeded while awaiting headers)]

'''

This is a problem as I see, Is there a way to get this resolved. 

What happens with this later for HPO team is that, the database is init state, HPO admin can't create any new accounts/exports etc. 

Temp Workaround which was done is 
on worker0 restarted the "Service Kubelet" that was made down earlier and then the noobaa-db-pg pod moved to worker2 w/o any problem. I u/s that this is linked with Kubelet service for the movement of the pod. 

Could you take a look at this defect and provide your thoughts/comments  ? 

  



### More information - Screenshots / Logs / Other output

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:19 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
baumcommented, Jan 18, 2022

@rkomandu thank you for this report!

Looks like a CSI issue, let me explain why.

According to the provided error: Multi-Attach error for volume “pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca” Volume is already exclusively attached to one node and can’t be attached to another

Usually, Multi-Attach error upon k8s node failure indicates an issue with the volume provisioning - CSI driver. The storage provisioner is expected to react to node failure and detach the volume from the failing node since the pod using the volume was deleted.

Another clue is the following error: AttachVolume.Attach failed for volume “pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca” : rpc error: code = Internal desc = ControllerPublishVolume : Error in getting filesystem Name for filesystem ID of 0D790B0A:61B0F1B9. Error [Get “https://ibm-spectrum-scale-gui.ibm-spectrum-scale:443/scalemgmt/v2/filesystems?filter=uuid=0D790B0A:61B0F1B9”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)]

This is an indication that the kube-controller-manager (attachdetach-controller) fails to talk to the ibm-spectrum-scale CSI driver.

Could you get input/feedback about this issue from the ibm-spectrum-scale CSI driver team?

Hope this helps, let me know if you need any additional info.

Best regards.

0reactions
rkomanducommented, Feb 7, 2022

ok @nimrod-becker , Based on the discussion with CSI team and from your above post, closing this defect for now.

Closed as it is now in CSI court.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Fix Kubernetes 'Node Not Ready' Error - Komodor
Node Not Ready error indicates a machine in a K8s cluster that cannot run pods. Learn about the causes of this problem and...
Read more >
Troubleshooting Kubernetes FailedAttachVolume and ...
When the workload needs to move somewhere else in the cluster, the reverse process occurs by unmounting the volume, detaching it from the...
Read more >
Node-pressure Eviction | Kubernetes
Node -pressure eviction is the process by which the kubelet proactively terminates pods to reclaim resources on nodes. The kubelet monitors ...
Read more >
How to Debug Kubernetes Pending Pods and Scheduling ...
Pending: The pod is waiting to get scheduled on a node, or for at least ... Kubernetes API server could not communicate with...
Read more >
Migrating to a new node group - Amazon EKS
You can migrate to a new node group using eksctl or the AWS Management Console. ... to all of your Kubernetes service accounts...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found