question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

All Persistant Volumes fail permanently after NAS reboot

See original GitHub issue

Whenever I reboot the OS on the NAS that hosts my ISCSI democratic-csi volumes, all containers that rely on those volumes fail consistently even after the NAS comes back online with the following error:

  Warning  FailedMount  37s               kubelet            MountVolume.MountDevice failed for volume "pvc-da280e70-9bcb-41ba-bbbd-cbf973580c6e" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount  34s               kubelet            Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[config media transcode kube-api-access-2c2w7 backup]: timed out waiting for the condition
  Warning  FailedMount  5s (x6 over 37s)  kubelet            MountVolume.MountDevice failed for volume "pvc-da280e70-9bcb-41ba-bbbd-cbf973580c6e" : rpc error: code = Aborted desc = operation locked due to in progress operation(s): ["volume_id_pvc-da280e70-9bcb-41ba-bbbd-cbf973580c6e"]

I have tried suspending all pods with kubectl scale -n media deploy/plex --replicas 0 to try and ensure that nothing is using the volume during the reboot.

Unfortunately I know almost nothing about ISCSI, so it’s entirely possible this is 100% my fault. What is the proper process with ISCSI for rebooting either the NAS, or the nodes using PVs on the NAS to prevent this lockup? Is there an iscsiadm command I can use to remove this deadlock and let the new container access the PV?

my democratic-csi config is:

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: csi-iscsi
  namespace: storage
spec:
  interval: 5m
  chart:
    spec:
      chart: democratic-csi
      version: 0.13.4
      sourceRef:
        kind: HelmRepository
        name: democratic-csi-charts
        namespace: flux-system
      interval: 5m
  values:
    csiDriver:
      name: "org.democratic-csi.iscsi"

    storageClasses:
    - name: tank-iscsi-csi
      defaultClass: true
      reclaimPolicy: Delete
      ## For testing
      # reclaimPolicy: Retain
      volumeBindingMode: Immediate
      allowVolumeExpansion: true
      parameters:
        fsType: ext4

    driver:
      image: docker.io/democraticcsi/democratic-csi:v1.7.6
      imagePullPolicy: IfNotPresent
      config:
        driver: zfs-generic-iscsi
      existingConfigSecret: zfs-generic-iscsi-config

and the driver config is:

apiVersion: v1
kind: Secret
metadata:
    name: zfs-generic-iscsi-config
    namespace: storage
stringData:
    driver-config-file.yaml: |
        driver: zfs-generic-iscsi
        sshConnection:
            host: ${UIHARU_IP}
            port: 22
            username: root
            privateKey: |
                -----BEGIN OPENSSH PRIVATE KEY-----
                ...
                -----END OPENSSH PRIVATE KEY-----
        zfs:
            datasetParentName: sltank/k8s/iscsiv
            detachedSnapshotsDatasetParentName: sltank/k8s/iscsis
        iscsi:
            shareStrategy: "targetCli"
            shareStrategyTargetCli:
                basename: "iqn.2016-04.com.open-iscsi:a6b73d4196"
                tpg:
                    attributes:
                        authentication: 0
                        generate_node_acls: 1
                        cache_dynamic_acls: 1
                        demo_mode_write_protect: 0
            targetPortal: "${UIHARU_IP}"

Not sure what other info is important, but I’d be happy to provide anything else that might help troubleshoot the issue.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
travisghansencommented, Sep 5, 2022

Yeah, that’s a dangerous situation (which is why when iscsi goes down the volumes go into ro mode). 2 nodes using the same block device simultaneously is not something you want happening. I would use something like kured (https://github.com/weaveworks/kured) or similar to simply trigger alll your nodes to cycle so the workloads shift around and everything comes up clean.

1reaction
travisghansencommented, Aug 31, 2022

Ah this is a tricky one and I’m glad you opened this. So there are a couple issues at play here:

  • democratic-csi ensures no 2 (possibly conflicting) operations happen at the same time and thus creates an in-memory lock
  • iscsi as a protocol will generally not handle this situation well and actually would require all your pods using iscsi volumes to restart

The first can be remedied by deleting all the democratic-csi pods and just letting them restart. The latter requires you to handle each workload in a case by case basis.

Essentially if the nas goes down and comes back up the iscsi sessions on the node (assuming they recover) go to read-only. The only way to remedy that (via k8s) is to just restart the pods as appropriate…and even then in some cases that may not be enough and would require forcing the workload to a new node. I’ll do some research on possible ways to just go to the cli of the nodes directly and get them back into a rw state manually without any other intervention at the k8s layer.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kubernetes Persistent Volumes and the PV Lifecycle
Containers are immutable, meaning that when a container shuts down, all data created during its lifetime is lost. This is suitable for some ......
Read more >
Disaster Recovery for NFS Volumes on Kubernetes Cluster
So if the server goes down permanently(eg. got hit by ransomware), all applications that use this NFS for volumes will be affected.
Read more >
Troubleshoot persistence volumes issues
To check if the status of your persistence volumes, run the kubectl get pvc command. If the output message shows that your PVC...
Read more >
issues reusing pv/pvc after pod reboot #12 - GitHub
I've tried rebooting the TrueNAS server since the problem started, and the same volumes have the same problems. all those other volumes get ......
Read more >
Persistent Volumes | Kubernetes
If expanding underlying storage fails, the cluster administrator can manually recover the Persistent Volume Claim (PVC) state and cancel the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found