question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AMQP inter-module communications are lost by transient workload API error

See original GitHub issue

While testing #1541, I faced transient module disconnection, which was never recovered automatically despite the ModuleClient.ConnectionStateChanged event handler reported that Disconnect_Retrying. The connection in edge hub side recovered after minutes, but in app module sides were not recovered. I think the AMQP SAS token refresh loop was died because workload API failure had not been handled correctly.

Expected Behavior

AMQP connection is recovered automatically after the transient error is fixed (with SAS token refresh)

Current Behavior

AMQP connection is never recovered automatically. We need to recreate whole ModuleClient as edge hub does (it requires implementing local-buffering to prevent in-progress telemetry message lost for each modules)

Steps to Reproduce

See “Additional Info” for details.

  1. Run modules with AMQP connection.
  2. Run some module which exhausts machine memory to cause out of memory in dockerd.
  3. Wait for AMQP SAS token refresh with praying OOM killer does not kill the module which exhausts memory.

Context (Environment)

Output of iotedge check

Click here
Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
√ config.yaml has correct hostname - OK
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
√ latest security daemon - OK
√ host time is close to real time - OK
√ container time is close to host time - OK
? DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
? production readiness: certificates - Warning
    Device is using self-signed, automatically generated certs.
    Please see https://aka.ms/iotedge-prod-checklist-certs for best practices.
? production readiness: certificates expiry - Warning
    Device CA certificate in /var/lib/iotedge/hsm/certs/device_ca_aliasazxVrrdEVxd7kvKvne1pOEyuSHF8EXSowNDhMzl30jI_.cert.pem will expire soon (2019-08-13 05:26:01 UTC)
√ production readiness: container engine - OK
? production readiness: logs policy - Warning
    Container engine is not configured to rotate module logs which may cause it run out of disk space.
    Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
    You can ignore this warning if you are setting log policy per module in the Edge deployment.

Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK
√ Edge Hub can bind to ports on host - OK

19 check(s) succeeded.
4 check(s) raised warnings. Re-run with --verbose for more details.

Device (Host) Operating System

Architecture

amd64

Container Operating System

Debian 9

Runtime Versions

iotedged

iotedged 1.0.8 (208b2204fd30e856d00b280112422130c104b9f0)

Edge Agent

1.0.8 (1.0.8.1 was used as log header)

Edge Hub

1.0.8 (1.0.8.1 was used as log header)

Docker

Docker version 3.0.5, build ba9934d4

Logs

iotedged logs

Note that timestamps are UTC+09:00, so please read “Sep 2 01:22:45” as “Sep 1 16:22:45 (UTC)”.

Sep  2 01:22:45 hostname dockerd[3070]: 8540, 0x1, 0xd446a0)\n\t/usr/local/go/src/runtime/malloc.go:939 +0x76e fp=0x7fffab337e30 sp=0x7fffab337d90 pc=0x40cb3e\nruntime.newobject(0x818540, 0x4000)\n\t/usr/local/go/src/runtime/malloc.go:1068 +0x38 fp=0x7fffab337e60 sp=0x7fffab337e30 pc=0x40cf48\nruntime.malg(0xdfeb00008000, 0xd28630)\n\t/usr/local/go/src/runtime/proc.go:3220 +0x31 fp=0x7fffab337ea0 sp=0x7fffab337e60 pc=0x435f41\nruntime.mpreinit(...)\n\t/usr/local/go/src/runtime/os_linux.go:311\nruntime.mcommoninit(0xd212c0)\n\t/usr/local/go/src/runtime/proc.go:618 +0xc2 fp=0x7fffab337ed8 sp=0x7fffab337ea0 pc=0x42f8f2\nruntime.schedinit()\n\t/usr/local/go/src/runtime/proc.go:540 +0x74 fp=0x7fffab337f30 sp=0x7fffab337ed8 pc=0x42f584\nruntime.rt0_go(0x7fffab338098, 0xb, 0x7fffab338098, 0x400370, 0x6df5e3, 0x0, 0xb00000000, 0x7fffab338098, 0x4572e0, 0x400370, ...)\n\t/usr/local/go/src/runtime/asm_amd64.s:195 +0x11a fp=0x7fffab337f38 sp=0x7fffab337f30 pc=0x45740a\n: unknown"
Sep  2 01:22:45 hostname iotedged[1084]: 2019-09-01T16:22:45Z [INFO] - [work] - - - [2019-09-01 16:22:45.751104275 UTC] "POST /modules/%24edgeHub/genid/637003371645686672/sign?api-version=2019-01-30 HTTP/1.1" 404 Not Found 30 "-" "-" auth_id(-)
edge-agent logs
(omitted)
edge-hub logs
\u003c4\u003e 2019-09-01 16:22:45.984 +00:00 [WRN] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection] - Error creating cloud connection for client LongRunDevice2/gwapp
Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadCommunicationException- Message:Error calling SignAsync: Module not found, StatusCode:404, at:   at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.HandleException(Exception ex, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/WorkloadClient.cs:line 106
   at Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClientVersioned.Execute[T](Func`1 func, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/WorkloadClientVersioned.cs:line 66
   at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.SignAsync(String keyId, String algorithm, String data) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/WorkloadClient.cs:line 96
   at Microsoft.Azure.Devices.Edge.Util.ClientTokenProvider.GetTokenAsync(Option`1 ttl) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/ClientTokenProvider.cs:line 53
   at Microsoft.Azure.Devices.Client.AuthenticationWithTokenRefresh.GetTokenAsync(String iotHub)
   at Microsoft.Azure.Devices.Client.IotHubConnectionString.Microsoft.Azure.Amqp.ICbsTokenProvider.GetTokenAsync(Uri namespaceAddress, String appliesTo, String[] requiredClaims)
   at Microsoft.Azure.Amqp.TaskHelpers.EndAsyncResult(IAsyncResult asyncResult)
   at Microsoft.Azure.Amqp.IteratorAsyncResult`1.StepCallback(IAsyncResult result)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
   at Microsoft.Azure.Amqp.AmqpCbsLink.\u003c\u003ec__DisplayClass4_0.\u003cSendTokenAsync\u003eb__1(IAsyncResult a)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpAuthenticationRefresher.InitLoopAsync(TimeSpan timeout)
   at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.AuthenticationRefresherCreator(DeviceIdentity deviceIdentity, TimeSpan timeout)
   at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpUnit.OpenAsync(TimeSpan timeout)
   at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.\u003c\u003ec__DisplayClass22_0.\u003c\u003cExecuteWithErrorHandlingAsync\u003eb__0\u003ed.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation)
   at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.\u003c\u003ec__DisplayClass32_0.\u003c\u003cOpenAsyncInternal\u003eb__0\u003ed.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync()
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ModuleClientWrapper.OpenAsync() in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ModuleClientWrapper.cs:line 48
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.\u003c\u003ec__DisplayClass28_0.\u003c\u003cInvokeFunc\u003eb__0\u003ed.MoveNext() in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 174
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.InvokeFunc[T](Func`1 func, String operation, Boolean useForConnectivityCheck) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 134
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.InvokeFunc[T](Func`1 func, String operation, Boolean useForConnectivityCheck) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 162
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.ConnectToIoTHub(ITokenProvider newTokenProvider) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 127
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.CreateNewCloudProxyAsync(ITokenProvider newTokenProvider) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 102
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.Create(IIdentity identity, Action`2 connectionStatusChangedHandler, ITransportSettings[] transportSettings, IMessageConverterProvider messageConverterProvider, IClientProvider clientProvider, ICloudListener cloudListener, ITokenProvider tokenProvider, TimeSpan idleTimeout, Boolean closeOnIdleTimeout, TimeSpan operationTimeout, String productInfo) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 91
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.\u003c\u003ec__DisplayClass16_1.\u003c\u003cConnect\u003eb__2\u003ed.MoveNext() in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnectionProvider.cs:line 139
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.Connect(IIdentity identity, Action`2 connectionStatusChangedHandler) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnectionProvider.cs:line 133

Additional Information

Hypothesis

After investigating logs and IoT Edge / .NET Device SDK sources, I made following hypothesis:

  1. By some bugs, the edge device’s memory was exhausted (this is fact).
  2. Due to low memory, dockerd(a part of IoT Edge runtime) failed to allocate any additional memories (fact, from syslog).
  3. AMQP SAS token for inter module communication was expired (fact, from edgehub’s log).
  4. AmqpAuthenticationRefresher tried to refresh SAS token via IotHubConnectionString.Microsoft.Azure.Amqp.ICbsTokenProvider.GetTokenAsync()
    1. It called token acquisition logic for IoT Edge (‘Microsoft.Azure.Devices.Edge.Util.ClientTokenProvider.GetTokenAsync()`)
    2. The ClientTokenProvider called edgelet(iotedged) workload API (sign API) to generate sas token.
    3. The edgelet tried to authenticate caller (my IoT Edge module).
      1. The edgelet called docker API in dockerd to list known module (docker container) process to get PIDs which were allowed to call sign API.
      2. The dockerd returned error because of out of memory error.
      3. The authentication process was failed.
    4. The edgelet returned 404 error due to authentication failure.
    5. The ClientTokenprovider threw WorkloadCommunicationException
  5. AmqpAuthenticationRefresher did not catch WorkloadCommunicationException (it only catches AmqpException), then the token refresh loop was gone.
  6. AMQP connection transited to Disconnected_Retrying state because of (transient) authentication failure, and (possibly) started to retry connection recovery. But it will never complete because SAS token refresh was no longer run.
  7. After a while, the edge device’s low memory was resolved, the AMQP connection was not recovered because SAS token refresh was stopped.

Additional Background

As above, I thought that this is SDK’s exception handling issue, then posted Device SDK issue, but I suggested that this issue should be IoT Edge issue.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
arsingcommented, Jan 21, 2020

IIRC at the time I looked, the version we were using had AmqpException and the one in their master was IotHubCommunicationException (or vice versa), which is why I mentioned both.

0reactions
damonbarrycommented, Jan 21, 2020

I’m looking at AmqpAuthenticationRefresher.RefreshLoopAsync, and I only see it catching IotHubCommunicationException, not AmqpException (maybe the code has changed, or I’m missing something…). Given that, we should probably not be throwing our own custom exceptions out of our plug-in to the SDK. IotHubCommunicationException, though the name doesn’t really fit in this case, is defined by the SDK, so we should probably wrap our error into that before returning to the SDK.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Nautilus
nautilus: mgr/dashboard: minimize console log traces of Ceph backend API ... and retry on transient errors from udev_enumerate_scan_devices() (pr#31075, ...
Read more >
6.2 Technical Notes Red Hat Enterprise Linux 6
Matahari provides a set of Application Programming Interfaces (APIs) for operating ... such error codes and the path failures are transient as expected....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found