AMQP inter-module communications are lost by transient workload API error
See original GitHub issueWhile testing #1541, I faced transient module disconnection, which was never recovered automatically despite the ModuleClient.ConnectionStateChanged
event handler reported that Disconnect_Retrying
. The connection in edge hub side recovered after minutes, but in app module sides were not recovered. I think the AMQP SAS token refresh loop was died because workload API failure had not been handled correctly.
Expected Behavior
AMQP connection is recovered automatically after the transient error is fixed (with SAS token refresh)
Current Behavior
AMQP connection is never recovered automatically. We need to recreate whole ModuleClient
as edge hub does (it requires implementing local-buffering to prevent in-progress telemetry message lost for each modules)
Steps to Reproduce
See “Additional Info” for details.
- Run modules with AMQP connection.
- Run some module which exhausts machine memory to cause out of memory in
dockerd
. - Wait for AMQP SAS token refresh with praying OOM killer does not kill the module which exhausts memory.
Context (Environment)
Output of iotedge check
Click here
Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
√ config.yaml has correct hostname - OK
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
√ latest security daemon - OK
√ host time is close to real time - OK
√ container time is close to host time - OK
? DNS server - Warning
Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
You can ignore this warning if you are setting DNS server per module in the Edge deployment.
? production readiness: certificates - Warning
Device is using self-signed, automatically generated certs.
Please see https://aka.ms/iotedge-prod-checklist-certs for best practices.
? production readiness: certificates expiry - Warning
Device CA certificate in /var/lib/iotedge/hsm/certs/device_ca_aliasazxVrrdEVxd7kvKvne1pOEyuSHF8EXSowNDhMzl30jI_.cert.pem will expire soon (2019-08-13 05:26:01 UTC)
√ production readiness: container engine - OK
? production readiness: logs policy - Warning
Container engine is not configured to rotate module logs which may cause it run out of disk space.
Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
You can ignore this warning if you are setting log policy per module in the Edge deployment.
Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK
√ Edge Hub can bind to ports on host - OK
19 check(s) succeeded.
4 check(s) raised warnings. Re-run with --verbose for more details.
Device (Host) Operating System
Architecture
amd64
Container Operating System
Debian 9
Runtime Versions
iotedged
iotedged 1.0.8 (208b2204fd30e856d00b280112422130c104b9f0)
Edge Agent
1.0.8 (1.0.8.1 was used as log header)
Edge Hub
1.0.8 (1.0.8.1 was used as log header)
Docker
Docker version 3.0.5, build ba9934d4
Logs
iotedged logs
Note that timestamps are UTC+09:00, so please read “Sep 2 01:22:45” as “Sep 1 16:22:45 (UTC)”.
Sep 2 01:22:45 hostname dockerd[3070]: 8540, 0x1, 0xd446a0)\n\t/usr/local/go/src/runtime/malloc.go:939 +0x76e fp=0x7fffab337e30 sp=0x7fffab337d90 pc=0x40cb3e\nruntime.newobject(0x818540, 0x4000)\n\t/usr/local/go/src/runtime/malloc.go:1068 +0x38 fp=0x7fffab337e60 sp=0x7fffab337e30 pc=0x40cf48\nruntime.malg(0xdfeb00008000, 0xd28630)\n\t/usr/local/go/src/runtime/proc.go:3220 +0x31 fp=0x7fffab337ea0 sp=0x7fffab337e60 pc=0x435f41\nruntime.mpreinit(...)\n\t/usr/local/go/src/runtime/os_linux.go:311\nruntime.mcommoninit(0xd212c0)\n\t/usr/local/go/src/runtime/proc.go:618 +0xc2 fp=0x7fffab337ed8 sp=0x7fffab337ea0 pc=0x42f8f2\nruntime.schedinit()\n\t/usr/local/go/src/runtime/proc.go:540 +0x74 fp=0x7fffab337f30 sp=0x7fffab337ed8 pc=0x42f584\nruntime.rt0_go(0x7fffab338098, 0xb, 0x7fffab338098, 0x400370, 0x6df5e3, 0x0, 0xb00000000, 0x7fffab338098, 0x4572e0, 0x400370, ...)\n\t/usr/local/go/src/runtime/asm_amd64.s:195 +0x11a fp=0x7fffab337f38 sp=0x7fffab337f30 pc=0x45740a\n: unknown"
Sep 2 01:22:45 hostname iotedged[1084]: 2019-09-01T16:22:45Z [INFO] - [work] - - - [2019-09-01 16:22:45.751104275 UTC] "POST /modules/%24edgeHub/genid/637003371645686672/sign?api-version=2019-01-30 HTTP/1.1" 404 Not Found 30 "-" "-" auth_id(-)
edge-agent logs
(omitted)
edge-hub logs
\u003c4\u003e 2019-09-01 16:22:45.984 +00:00 [WRN] [Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection] - Error creating cloud connection for client LongRunDevice2/gwapp
Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadCommunicationException- Message:Error calling SignAsync: Module not found, StatusCode:404, at: at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.HandleException(Exception ex, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/WorkloadClient.cs:line 106
at Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClientVersioned.Execute[T](Func`1 func, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/WorkloadClientVersioned.cs:line 66
at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.SignAsync(String keyId, String algorithm, String data) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/WorkloadClient.cs:line 96
at Microsoft.Azure.Devices.Edge.Util.ClientTokenProvider.GetTokenAsync(Option`1 ttl) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/ClientTokenProvider.cs:line 53
at Microsoft.Azure.Devices.Client.AuthenticationWithTokenRefresh.GetTokenAsync(String iotHub)
at Microsoft.Azure.Devices.Client.IotHubConnectionString.Microsoft.Azure.Amqp.ICbsTokenProvider.GetTokenAsync(Uri namespaceAddress, String appliesTo, String[] requiredClaims)
at Microsoft.Azure.Amqp.TaskHelpers.EndAsyncResult(IAsyncResult asyncResult)
at Microsoft.Azure.Amqp.IteratorAsyncResult`1.StepCallback(IAsyncResult result)
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
at Microsoft.Azure.Amqp.AmqpCbsLink.\u003c\u003ec__DisplayClass4_0.\u003cSendTokenAsync\u003eb__1(IAsyncResult a)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpAuthenticationRefresher.InitLoopAsync(TimeSpan timeout)
at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpConnectionHolder.AuthenticationRefresherCreator(DeviceIdentity deviceIdentity, TimeSpan timeout)
at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpUnit.OpenAsync(TimeSpan timeout)
at Microsoft.Azure.Devices.Client.Transport.Amqp.AmqpTransportHandler.OpenAsync(CancellationToken cancellationToken)
at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken)
at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.\u003c\u003ec__DisplayClass22_0.\u003c\u003cExecuteWithErrorHandlingAsync\u003eb__0\u003ed.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation)
at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.\u003c\u003ec__DisplayClass32_0.\u003c\u003cOpenAsyncInternal\u003eb__0\u003ed.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken)
at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync()
at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ModuleClientWrapper.OpenAsync() in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ModuleClientWrapper.cs:line 48
at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.\u003c\u003ec__DisplayClass28_0.\u003c\u003cInvokeFunc\u003eb__0\u003ed.MoveNext() in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 174
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.InvokeFunc[T](Func`1 func, String operation, Boolean useForConnectivityCheck) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 134
at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ConnectivityAwareClient.InvokeFunc[T](Func`1 func, String operation, Boolean useForConnectivityCheck) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ConnectivityAwareClient.cs:line 162
at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.ConnectToIoTHub(ITokenProvider newTokenProvider) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 127
at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.CreateNewCloudProxyAsync(ITokenProvider newTokenProvider) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 102
at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnection.Create(IIdentity identity, Action`2 connectionStatusChangedHandler, ITransportSettings[] transportSettings, IMessageConverterProvider messageConverterProvider, IClientProvider clientProvider, ICloudListener cloudListener, ITokenProvider tokenProvider, TimeSpan idleTimeout, Boolean closeOnIdleTimeout, TimeSpan operationTimeout, String productInfo) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnection.cs:line 91
at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.\u003c\u003ec__DisplayClass16_1.\u003c\u003cConnect\u003eb__2\u003ed.MoveNext() in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnectionProvider.cs:line 139
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.Connect(IIdentity identity, Action`2 connectionStatusChangedHandler) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnectionProvider.cs:line 133
Additional Information
Hypothesis
After investigating logs and IoT Edge / .NET Device SDK sources, I made following hypothesis:
- By some bugs, the edge device’s memory was exhausted (this is fact).
- Due to low memory,
dockerd
(a part of IoT Edge runtime) failed to allocate any additional memories (fact, from syslog). - AMQP SAS token for inter module communication was expired (fact, from edgehub’s log).
AmqpAuthenticationRefresher
tried to refresh SAS token viaIotHubConnectionString.Microsoft.Azure.Amqp.ICbsTokenProvider.GetTokenAsync()
- It called token acquisition logic for IoT Edge (‘Microsoft.Azure.Devices.Edge.Util.ClientTokenProvider.GetTokenAsync()`)
- The
ClientTokenProvider
callededgelet
(iotedged
) workload API (sign
API) to generate sas token. - The
edgelet
tried to authenticate caller (my IoT Edge module).- The
edgelet
called docker API indockerd
to list known module (docker container) process to get PIDs which were allowed to callsign
API. - The
dockerd
returned error because of out of memory error. - The authentication process was failed.
- The
- The
edgelet
returned404
error due to authentication failure. - The
ClientTokenprovider
threwWorkloadCommunicationException
AmqpAuthenticationRefresher
did not catchWorkloadCommunicationException
(it only catchesAmqpException
), then the token refresh loop was gone.- AMQP connection transited to
Disconnected_Retrying
state because of (transient) authentication failure, and (possibly) started to retry connection recovery. But it will never complete because SAS token refresh was no longer run. - After a while, the edge device’s low memory was resolved, the AMQP connection was not recovered because SAS token refresh was stopped.
Additional Background
As above, I thought that this is SDK’s exception handling issue, then posted Device SDK issue, but I suggested that this issue should be IoT Edge issue.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (3 by maintainers)
IIRC at the time I looked, the version we were using had
AmqpException
and the one in their master wasIotHubCommunicationException
(or vice versa), which is why I mentioned both.I’m looking at
AmqpAuthenticationRefresher.RefreshLoopAsync
, and I only see it catchingIotHubCommunicationException
, notAmqpException
(maybe the code has changed, or I’m missing something…). Given that, we should probably not be throwing our own custom exceptions out of our plug-in to the SDK.IotHubCommunicationException
, though the name doesn’t really fit in this case, is defined by the SDK, so we should probably wrap our error into that before returning to the SDK.