question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HPO 659: Mix read/write with md5 checking halts all I/O on all metalLB IP's and does not recover unless noobaa pod restart

See original GitHub issue

Environment info

  • NooBaa Version: [root@c83f1-app1 ~]# noobaa status INFO[0000] CLI version: 5.9.2 INFO[0000] noobaa-image: noobaa/noobaa-core:nsfs_backport_5.9-20220331 INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:01a31a47a43f01c333981056526317dfec70d1072dbd335c8386e0b3f63ef052 INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:98990a28bec6aa05b70411ea5bd9c332939aea02d9d61eedf7422a32cfa0be54
  • Platform: [root@c83f1-app1 ~]# oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.9.5 NooBaa Operator 4.9.5 mcg-operator.v4.9.4 Succeeded ocs-operator.v4.9.5 OpenShift Container Storage 4.9.5 ocs-operator.v4.9.4 Succeeded odf-operator.v4.9.5 OpenShift Data Foundation 4.9.5 odf-operator.v4.9.4 Succeeded

Actual behavior

This is not the same as issue 6930.

In this issue that I am opening, it is true that the node remained in the Ready state so I don’t expect any IP failover. This defect is not about metallb IP’s not failing over. In this defect, I/O was running to metallb IP 172.20.100.31 which is for node master1. On node master0 , in the CNSA scale core pod, (namespace ibm-spectrum-scale), mmshutdown was issued for just that node. The other nodes remained active and with the filesystem mounted. Master0 has metallb IP 172.20.100.30. There was no I/O going to that IP.

What was observed after mmshutdown on master0 was that all I/O going to 172.20.100.31 stopped. Because of issue 6930, there was no failover. That is fine and expected. But what is not expected is for all I/O to stop.

sh-4.4# date; mmshutdown
Tue Apr  5 17:10:28 UTC 2022
Tue Apr  5 17:10:28 UTC 2022: mmshutdown: Starting force unmount of GPFS file systems
Tue Apr  5 17:10:34 UTC 2022: mmshutdown: Shutting down GPFS daemons
Shutting down!

When mmshutdown was issued, the noobaa endpoint pods only error was Stale file handle

Logs show stale file handle

pod/noobaa-endpoint-7fdb5b75fd-t99nd/endpoint] Apr-5 17:12:26.930 [Endpoint/14] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource>/s5001b85</Resource><RequestId>l1mefzt6-3wj6yz-8x</RequestId></Error> PUT /s5001b85 {"host":"172.20.100.30","accept-encoding":"identity","user-agent":"aws-cli/2.3.2 Python/3.8.8 Linux/4.18.0-240.el8.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.mb","x-amz-date":"20220405T171225Z","x-amz-content-sha256":"61d056dc66f1882c0f4053be381523a7a28d384abde04fcf5b0021c716bb0ea1","authorization":"AWS4-HMAC-SHA256 Credential=QzhyXj9wVDH9DvnK97L9/20220405/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=6d9fa5c22501bfed4f312ac47621b6cec691bf1cf8f719e8250fcdc0f61522f1","content-length":"154"} Error: Stale file handle
[pod/noobaa-endpoint-7fdb5b75fd-t99nd/endpoint] Apr-5 17:12:27.589 [Endpoint/14]    [L0] core.sdk.bucketspace_nb:: could not create underlying directory - nsfs, deleting bucket [Error: Stale file handle] { code: 'Unknown system error -116' }
[pod/noobaa-endpoint-7fdb5b75fd-t99nd/endpoint] Apr-5 17:12:27.793 [Endpoint/14] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource>/s5001b85</Resource><RequestId>l1meg1if-7apxvy-1bas</RequestId></Error> PUT /s5001b85 {"host":"172.20.100.30","accept-encoding":"identity","user-agent":"aws-cli/2.3.2 Python/3.8.8 Linux/4.18.0-240.el8.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.mb","x-amz-date":"20220405T171227Z","x-amz-content-sha256":"61d056dc66f1882c0f4053be381523a7a28d384abde04fcf5b0021c716bb0ea1","authorization":"AWS4-HMAC-SHA256 Credential=QzhyXj9wVDH9DvnK97L9/20220405/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=bb34c5aa7ed665ae73a2d172ab33d4056c2611ad2c09331ece80089cff46df05","content-length":"154"} Error: Stale file handle
[root@c83f1-app1 ~

This error is a bit odd because it is on the endpoint pod that was for master0. Master0’s metallb IP was 172.20.100.30. Cosbench workload was only set up for 172.20.100.31.

An additional observation is that s3 command for list will work but not for write.

[root@c83f1-dan4 RW_workloads]# date; s5001_2_31 ls
Tue Apr  5 18:33:40 EDT 2022
2022-04-05 18:33:43 s5001b100
2022-04-05 18:33:43 s5001b63
2022-04-05 18:33:43 s5001b62
2022-04-05 18:33:43 s5001b61


[root@c83f1-dan4 RW_workloads]# date;  s5001_2_31 cp alias_commands s3://s5001b1
Tue Apr  5 18:35:36 EDT 2022
^C^Cfatal error:
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "concurrent/futures/thread.py", line 40, in _python_exit
  File "threading.py", line 1011, in join
  File "threading.py", line 1027, in _wait_for_tstate_lock
KeyboardInterrupt
[root@c83f1-dan4 RW_workloads]# date
Tue Apr  5 18:36:43 EDT 2022
[root@c83f1-dan4 RW_workloads]#

All future PUT to 172.20.100.31 and 172.20.100.32, get a timeout (if I don’t CTL-C) and the endpoint pods record a “Error: Semaphore Timeout” From 31 and 32, we can do GET and we can read from the Noobaa database. If we rsh into the endpoint pods for the IP’s of 172.20100.31 and 172.20.100.32, we see that Spectrum Scale is still mounted in the correct place and we can write to it manually with touch file. So, this tells us that the IP’s 31 and 32 are still alive and that the noobaa db is still online. It also tells us that the Spectrum Scale filesystem is still mounted and writable. The timeout on the subsequent PUT tell us that it makes a connection request but never gets a response.

The endpoint pods never restarted and they sill have their labels.

Also, In the scale core pod we run mmhealth node show -N all and we see that everything is HEALTHY, except of course on the one node that we did mmshutdown.

  sh-4.4# mmhealth node show

Node name:      master1-daemon
Node status:    HEALTHY
Status Change:  1 day ago

Component      Status        Status Change     Reasons & Notices
----------------------------------------------------------------
GPFS           HEALTHY       1 day ago         -
GUI            HEALTHY       1 day ago         -
NETWORK        HEALTHY       9 days ago        -
FILESYSTEM     HEALTHY       9 days ago        -
NOOBAA         HEALTHY       26 min. ago       -
PERFMON        HEALTHY       1 day ago         -
THRESHOLD      HEALTHY       9 days ago        -
sh-4.4# set -o vi
sh-4.4# mmhealth node show -N all

Node name:      master0-daemon
Node status:    FAILED
Status Change:  36 min. ago

Component      Status        Status Change     Reasons & Notices
--------------------------------------------------------------------------------
GPFS           FAILED        36 min. ago       gpfs_down, quorum_down
NETWORK        HEALTHY       1 day ago         -
FILESYSTEM     DEPEND        36 min. ago       unmounted_fs_check(remote-sample)
PERFMON        HEALTHY       1 day ago         -
THRESHOLD      HEALTHY       1 day ago         -

Node name:      master1-daemon
Node status:    HEALTHY
Status Change:  1 day ago

Component      Status        Status Change     Reasons & Notices
----------------------------------------------------------------
GPFS           HEALTHY       1 day ago         -
GUI            HEALTHY       1 day ago         -
NETWORK        HEALTHY       9 days ago        -
FILESYSTEM     HEALTHY       9 days ago        -
NOOBAA         HEALTHY       27 min. ago       -
PERFMON        HEALTHY       1 day ago         -
THRESHOLD      HEALTHY       9 days ago        -

Node name:      master2-daemon
Node status:    HEALTHY
Status Change:  1 day ago

Component       Status        Status Change     Reasons & Notices
-----------------------------------------------------------------
CALLHOME        HEALTHY       9 days ago        -
GPFS            HEALTHY       1 day ago         -
NETWORK         HEALTHY       9 days ago        -
FILESYSTEM      HEALTHY       9 days ago        -
GUI             HEALTHY       3 days ago        -
HEALTHCHECK     HEALTHY       9 days ago        -
PERFMON         HEALTHY       9 days ago        -
THRESHOLD       HEALTHY       9 days ago        -

Something is obviously hung in the PUT connection but logs and noobaa health don’t point to anything. When we issue mmstartup the PUT’s still fail. The only way to recover is to delete the Noobaa endpoint pods and have new ones generated again.

I have been able to recreate this very easily so if it is required I can set this up on my test stand

Expected behavior

1.When doing mmshutdown on one node, it should not impact cluster wide I/O capability. It should not be an outage. If indeed an outage is expected, then mmstartup should recover I/O capability.

Steps to reproduce

  1. Start a cosbench run. I can provide the xml if needed. Once I/O is running, issue mmshutdown from within one CNSA Ibm Spectrum Scale core pod.

More information - Screenshots / Logs / Other output

Must gather and noobaa diagnose in https://ibm.ent.box.com/folder/145794528783?s=uueh7fp424vxs2bt4ndrnvh7uusgu6tocd

This issue started as HPO https://github.ibm.com/IBMSpectrumScale/hpo-core/issues/659 Screeners determined that it was with Nooobaa. I have also slacked the CNSA team for input but have not heard back.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:51 (30 by maintainers)

github_iconTop GitHub Comments

2reactions
MonicaLemaycommented, May 13, 2022

I tested the patch today and it works. I no longer see the issue. Thank you.

1reaction
MonicaLemaycommented, Apr 19, 2022

When files are size 100 - 200 MB the condition does not occur. This defect does not happen.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting - MetalLB
Once you setup a loadbalancer on Kubernetes and it doesn't seem to work right away, here are a few tips to troubleshoot. Check...
Read more >
Pod/deployment not accessible after restart/redeploy unless ...
Hi, I am finding an issue whereby when I create an external IP for a pod/deployment everything works fine (i.e. the pod is...
Read more >
Chapter 26. Load balancing with MetalLB OpenShift ...
All the nodes with a speaker pod that advertises the load balancer IP address can receive traffic for the service. If the external...
Read more >
About MetalLB and the MetalLB Operator | Networking
All the nodes with a speaker pod that advertises the load balancer IP address can receive traffic for the service. If the external...
Read more >
Installing MetalLB Load Balancer - Documentation - k0s
Load balancers can be used for exposing applications to the external network. Load balancer provides a single IP address to route incoming requests...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found