Module fails to restart because of transient network error
See original GitHub issueExpected Behavior
When an IoT Edge module disconnects it should be able to reconnect.
Current Behavior
I have an IoT Edge solution that runs on a Raspberry Pi and consists of 4 IoT Edge modules. Upon deployment of the solution to the Edge Device (Pi), everything was working fine. After a couple of hours, one of the modules was being restarted by the IoT Edge runtime, and failed to start again. All the other modules just keep on working.
When I look in the logs of the failing module, I can see that an exception is being thrown with the following stacktrace:
Unhandled Exception: Microsoft.Azure.Devices.Client.Exceptions.IotHubCommunicationException: Transient network error occurred, please retry. ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException: No such device or address
at System.Net.Dns.InternalGetHostByName(String hostName)
at System.Net.Dns.ResolveCallback(Object context)
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
at System.Net.Dns.EndGetHostAddresses(IAsyncResult asyncResult)
at System.Net.Dns.<>c.<GetHostAddressesAsync>b__25_1(IAsyncResult asyncResult)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Client.Transport.Mqtt.MqttTransportHandler.OpenAsyncInternal(CancellationToken cancellationToken)
at Microsoft.Azure.Devices.Client.Transport.Mqtt.MqttTransportHandler.OpenAsync(CancellationToken cancellationToken)
at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken)
at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>c__DisplayClass22_0.<<ExecuteWithErrorHandlingAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation)
--- End of inner exception stack trace ---
at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation)
at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass32_0.<<OpenAsyncInternal>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken)
at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync()
at DiscoveryModule.Program.Init() in /app/Program.cs:line 210
at DiscoveryModule.Program.Main(String[] args) in /app/Program.cs:line 25
at DiscoveryModule.Program.<Main>(String[] args)
This exception occurs when the OpenAsync
method on the ModuleClient
is being called to open the connection to IoT Hub.
The module is using MQTT as a communication protocol; this is how the connection is being established:
var mqttSetting = new MqttTransportSettings(TransportType.Mqtt_Tcp_Only);
ITransportSettings[] settings = { mqttSetting };
// Open a connection to the Edge runtime
ModuleClient ioTHubModuleClient = await ModuleClient.CreateFromEnvironmentAsync(settings);
await ioTHubModuleClient.OpenAsync();
var currentTwin = await ioTHubModuleClient.GetTwinAsync();
My other modules that are running on the same device (and that keep working), are also using MQTT.
I’m using Microsoft.Azure.Devices.Client
1.20.3 in all my Edge Modules (both on the one that’s failing and in the other ones that are still working).
I also tried to restart the module manually via iotedge restart <modulename>
, but this didn’t help. I assume that the problem will be solved (temporarely?) when I reboot the device, but this is not something that can be done when the device is ‘in the field’.
The module that is failing has the following ‘createOptions’:
"createOptions": {
"HostConfig": {
"NetworkMode": "host"
},
"NetworkingConfig": {
"EndpointsConfig": {
"host": {}
}
}
}
(However, this module is also running on another device and this issue doesn’t exists on the other device).
Context (Environment)
Output of iotedge check
Click here
Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
√ config.yaml has correct hostname - OK
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
√ latest security daemon - OK
√ host time is close to real time - OK
√ container time is close to host time - OK
‼ DNS server - Warning
Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: certificates - Warning
Device is using self-signed, automatically generated certs.
Please see https://aka.ms/iotedge-prod-checklist-certs for best practices.
√ production readiness: certificates expiry - OK
√ production readiness: container engine - OK
‼ production readiness: logs policy - Warning
Container engine is not configured to rotate module logs which may cause it run out of disk space.
Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
You can ignore this warning if you are setting log policy per module in the Edge deployment.
Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK
√ Edge Hub can bind to ports on host - OK
20 check(s) succeeded.
3 check(s) raised warnings. Re-run with --verbose for more details.
Device Information
- Host OS : Raspbian gnu/linux 9 (stretch)
- Architecture: ARM32
- Container OS: Linux containers
Runtime Versions
- iotedged [run
iotedge version
]: iotedge 1.0.8 (208b2204fd30e856d00b280112422130c104b9f0) - Edge Agent [image tag (e.g. 1.0.0)]: 1.0.8.3
- Edge Hub [image tag (e.g. 1.0.0)]: 1.0.8.3
- Docker/Moby [run
docker version
]: 3.0.7
Logs
edge-agent logs
<6> 2019-11-12 16:12:42.456 +00:00 [INF] - Executing command: "Command Group: (\n [Stop module DiscoveryModule]\n [Start module DiscoveryModule]\n [Saving DiscoveryModule to store]\n)"
<6> 2019-11-12 16:12:42.456 +00:00 [INF] - Executing command: "Stop module DiscoveryModule"
<6> 2019-11-12 16:12:42.466 +00:00 [INF] - Executing command: "Start module DiscoveryModule"
<6> 2019-11-12 16:12:43.026 +00:00 [INF] - Executing command: "Saving DiscoveryModule to store"
<6> 2019-11-12 16:12:43.027 +00:00 [INF] - Plan execution ended for deployment 9
<6> 2019-11-12 16:12:48.412 +00:00 [INF] - Updated reported properties
<6> 2019-11-12 16:16:58.822 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:55s left).
<6> 2019-11-12 16:16:58.995 +00:00 [INF] - Updated reported properties
<6> 2019-11-12 16:17:04.077 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:49s left).
<6> 2019-11-12 16:17:09.186 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:44s left).
<6> 2019-11-12 16:17:14.296 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:39s left).
<6> 2019-11-12 16:17:19.408 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:34s left).
<6> 2019-11-12 16:17:24.515 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:29s left).
<6> 2019-11-12 16:17:29.633 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:24s left).
<6> 2019-11-12 16:17:34.741 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:19s left).
<6> 2019-11-12 16:17:39.848 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:13s left).
<6> 2019-11-12 16:17:44.959 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:08s left).
edge-hub logs
<6> 2019-11-12 15:23:53.408 +00:00 [INF] - Credentials for client DG-0000000-191025-001/ActionExecutor are not valid.
<4> 2019-11-12 15:23:53.409 +00:00 [WRN] - Reauthenticating client DG-0000000-191025-001/ActionExecutor failed, removing client connection
<6> 2019-11-12 15:23:53.409 +00:00 [INF] - Unable to re-authenticate DG-0000000-191025-001/ActionExecutor, dropping client connection.
<4> 2019-11-12 15:23:53.409 +00:00 [WRN] - Closing connection for device: DG-0000000-191025-001/ActionExecutor, Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DG-0000000-191025-001/ActionExecutor.,
<6> 2019-11-12 15:23:53.409 +00:00 [INF] - Disposing MessagingServiceClient for device Id DG-0000000-191025-001/ActionExecutor because of exception - Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DG-0000000-191025-001/ActionExecutor.
<6> 2019-11-12 15:23:53.410 +00:00 [INF] - Closing device proxy for device Id DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.410 +00:00 [INF] - Device connection removed for device DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.559 +00:00 [INF] - Client DG-0000000-191025-001/ActionExecutor in device scope authenticated locally.
<6> 2019-11-12 15:23:53.560 +00:00 [INF] - Successfully generated identity for clientId DG-0000000-191025-001/ActionExecutor and username dg-0000000-191025-001/DG-0000000-191025-001/ActionExecutor/?api-version=2018-06-30&DeviceClientType=.NET%2F1.20.3%20%28.NET%20Core%204.6.28008.02%3B%20Linux%204.19.66-v7%2B%20%231253%20SMP%20Thu%20Aug%2015%2011%3A49%3A46%20BST%202019%3B%20Arm%29
<6> 2019-11-12 15:23:53.560 +00:00 [INF] - ClientAuthenticated, DG-0000000-191025-001/ActionExecutor, 3bee4180
<6> 2019-11-12 15:23:53.563 +00:00 [INF] - New device connection for device DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.568 +00:00 [INF] - Bind device proxy for device DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.568 +00:00 [INF] - Binding message channel for device Id DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.569 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, Methods for client DG-0000000-191025-001/ActionExecutor.
<6> 2019-11-12 15:23:53.586 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.586 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:23:53.589 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, Methods for client DG-0000000-191025-001/ActionExecutor.
<6> 2019-11-12 15:23:53.591 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.593 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, Methods for client DG-0000000-191025-001/ActionExecutor.
<6> 2019-11-12 15:23:53.596 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:58.725 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:28:53.416 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:28:53.422 +00:00 [INF] - Error authenticating token for DG-0000000-191025-001/TelemetryRetriever because the token is expired or could not be parsed
<6> 2019-11-12 15:28:53.423 +00:00 [INF] - Credentials for client DG-0000000-191025-001/TelemetryRetriever are not valid.
<4> 2019-11-12 15:28:53.423 +00:00 [WRN] - Reauthenticating client DG-0000000-191025-001/TelemetryRetriever failed, removing client connection
<6> 2019-11-12 15:28:53.424 +00:00 [INF] - Unable to re-authenticate DG-0000000-191025-001/TelemetryRetriever, dropping client connection.
<4> 2019-11-12 15:28:53.424 +00:00 [WRN] - Closing connection for device: DG-0000000-191025-001/TelemetryRetriever, Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DG-0000000-191025-001/TelemetryRetriever.,
<6> 2019-11-12 15:28:53.424 +00:00 [INF] - Disposing MessagingServiceClient for device Id DG-0000000-191025-001/TelemetryRetriever because of exception - Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DG-0000000-191025-001/TelemetryRetriever.
<6> 2019-11-12 15:28:53.431 +00:00 [INF] - Closing device proxy for device Id DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.432 +00:00 [INF] - Device connection removed for device DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.587 +00:00 [INF] - Client DG-0000000-191025-001/TelemetryRetriever in device scope authenticated locally.
<6> 2019-11-12 15:28:53.588 +00:00 [INF] - Successfully generated identity for clientId DG-0000000-191025-001/TelemetryRetriever and username dg-0000000-191025-001/DG-0000000-191025-001/TelemetryRetriever/?api-version=2018-06-30&DeviceClientType=.NET%2F1.20.3%20%28.NET%20Core%204.6.28008.02%3B%20Linux%204.19.66-v7%2B%20%231253%20SMP%20Thu%20Aug%2015%2011%3A49%3A46%20BST%202019%3B%20Arm%29
<6> 2019-11-12 15:28:53.588 +00:00 [INF] - ClientAuthenticated, DG-0000000-191025-001/TelemetryRetriever, 190cc900
<6> 2019-11-12 15:28:53.591 +00:00 [INF] - New device connection for device DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.595 +00:00 [INF] - Bind device proxy for device DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.595 +00:00 [INF] - Binding message channel for device Id DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.597 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, DesiredPropertyUpdates for client DG-0000000-191025-001/TelemetryRetriever.
<6> 2019-11-12 15:28:53.599 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.602 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, DesiredPropertyUpdates for client DG-0000000-191025-001/TelemetryRetriever.
<6> 2019-11-12 15:28:53.604 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.606 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:28:53.606 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, DesiredPropertyUpdates for client DG-0000000-191025-001/TelemetryRetriever.
<6> 2019-11-12 15:28:53.608 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:58.808 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:33:53.410 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:38:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:43:53.408 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:44:11.827 +00:00 [INF] - Starting periodic operation Get EdgeHub config...
<6> 2019-11-12 15:44:11.894 +00:00 [INF] - Obtained edge hub config from module twin
<6> 2019-11-12 15:44:11.895 +00:00 [INF] - Successfully completed periodic operation Get EdgeHub config
<6> 2019-11-12 15:44:12.038 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:48:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:49:54.537 +00:00 [INF] - Started task to cleanup processed and stale messages for endpoint iothub
<6> 2019-11-12 15:49:54.539 +00:00 [INF] - Cleaned up 0 messages from queue for endpoint iothub and 0 messages from message store.
<6> 2019-11-12 15:50:24.540 +00:00 [INF] - Started task to cleanup processed and stale messages for endpoint DG-0000000-191025-001/ActionExecutor/input1
<6> 2019-11-12 15:50:24.542 +00:00 [INF] - Cleaned up 0 messages from queue for endpoint DG-0000000-191025-001/ActionExecutor/input1 and 0 messages from message store.
<6> 2019-11-12 15:50:54.539 +00:00 [INF] - Started task to cleanup processed and stale messages for endpoint DG-0000000-191025-001/TelemetryRetriever/input1
<6> 2019-11-12 15:50:54.541 +00:00 [INF] - Cleaned up 0 messages from queue for endpoint DG-0000000-191025-001/TelemetryRetriever/input1 and 0 messages from message store.
<6> 2019-11-12 15:52:43.928 +00:00 [INF] - Starting refresh of device scope identities cache
<6> 2019-11-12 15:53:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:58:53.406 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 16:03:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 16:08:53.406 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 16:13:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 16:18:53.407 +00:00 [INF] - Reauthenticating connected clients
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
I think the change to the /etc/hosts file fixed your issue because the connection failed from the module to edgeHub. The change to daemon.json to add the DNS is to fix issues from edgeHub to the cloud. Because your module is running on the host network it needs to be able to resolve the GatewayHostname, you can check which value it has in the Env for IOTEDGE_GATEWAYHOSTNAME using:
docker inspect <moduleName>
To get daemon logs:
sudo journalctl -u iotedge -f
We will evaluate if iotedge check should validate /etc/hosts and /etc/hostname
I will close this issue, please reopen if needed.