Mounting several volumes at once fails very frequently
See original GitHub issueWe’re encountering problems when trying to mount multiple EFS volumes at once. The mount process gets stuck, when trying to debug RPC there are occassional nfs: server 127.0.0.1 not responding, timed out
errors in the log (not sure if those are related – the mount.efs should retry AFAIK). The stunnel processes serving the mount RPC connections seem to be just waiting for for connection but nothing happens.
This problem has been observed only with Centos 8 (or Centos Streams 8) running stunnel-5.56-5.el8_3 and openssl-libs-1.1.1k-5.el8_5. When trying Amazon Linux 2 with stunnel-4.56-6.amzn2.0.3 and openssl-libs-1.0.2k-19.amzn2.2.0.10 everything works OK. I suspected this is a race in stunnel, so I’ve tried to recompile stunnel-5.56-5 and install it on Amazon Linux 2 but the issue is again not reproducible, so it’s not stunnel (or not stunnel itself).
The issue seems to be also quite timing-sensitive. Increasing log level or changing stunnel options seems to have effect on probability of the problem to show. I’ve tried to remove PID file creation (since the issue #112 looked quite similar) but it doesn’t seem to help – I can still see the pending mounts. I also suspected the issue #105, but even if I fixed that (I hope – PR #119) the mounts still get stuck.
I wonder if the problem in issue #114 is related: we’re mostly encountering this problem through efs-csi-driver on kubernetes clusters when creating and removing multiple EFS volumes in the cluster in one shot.
I’m curious if somebody had more insight or encountered the problem: it looks like it’s the combination of multiple factors that cause this and I failed to find any interesting debugging clues.
Issue Analytics
- State:
- Created a year ago
- Comments:14 (12 by maintainers)
Thanks. kvifern@ provided a workaround that launch a unit to monitor the file system in https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/616#issuecomment-1072965716.
Killing the stunnel process and watchdog will relaunch a new stunnel which will reconnect to server. You can try that and verify if that is working. Meanwhile we are looking this kind of issue right now actively.
Closing as we’ve resolved this issue with the v1.34.4 release