Thread leakage and unhandled exception due to constant disconnects
See original GitHub issueContext
- OS and version used: Ubuntu 22.04, Docker (python:3.10-slim-buster)
- Python version: CPython 3.10.5
- pip version: 22.2.2
- list of installed packages:
Package Version
--------------------------- ---------
aiofiles 0.8.0
aiohttp 3.8.1
aioprocessing 2.0.0
aiosignal 1.2.0
anyio 3.6.1
asgiref 3.5.2
async-timeout 4.0.2
asyncinotify 2.0.2
attrs 22.1.0
azure-core 1.25.0
azure-iot-device 2.11.0
azure-storage-blob 12.12.0
azure-storage-file-datalake 12.7.0
bcrypt 3.1.7
certifi 2022.6.15
cffi 1.15.1
charset-normalizer 2.1.1
click 8.1.3
croniter 1.0.1
cryptography 37.0.4
debugpy 1.6.0
deprecation 2.1.0
elastic-apm 6.9.1
fastapi 0.75.0
frozenlist 1.3.1
h11 0.13.0
idna 3.3
isodate 0.6.1
janus 1.0.0
jsonschema 3.2.0
msrest 0.7.1
multidict 6.0.2
natsort 8.1.0
numpy 1.23.2
oauthlib 3.2.0
packaging 21.3
paho-mqtt 1.6.1
pandas 1.4.2
pip 22.2.2
pyarrow 7.0.0
pycparser 2.21
pydantic 1.9.2
pyparsing 3.0.9
pyrsistent 0.18.1
PySocks 1.7.1
python-dateutil 2.8.2
pytz 2022.2.1
requests 2.28.1
requests-oauthlib 1.3.1
requests-unixsocket 0.3.0
setuptools 65.2.0
six 1.16.0
sniffio 1.2.0
sortedcontainers 2.3.0
starlette 0.17.1
typing_extensions 4.3.0
urllib3 1.26.11
uvicorn 0.17.5
uvloop 0.16.0
wheel 0.37.1
yarl 1.8.1
- cloned repo: N/A
Description of the issue
Since 18-08-2022, the IoTHub started intermittently returning 500 (Internal Server Error) when using the file upload functionality, with file notifications. After restarting the python applications over the weekend, the HTTP server (uvicorn+uvloop) stopped responding entirely after running for another day.
After checking the logs, I noticed 2 things:
- The IoT library kept disconnecting. This seems to create a new thread every time it disconnects.
- An unhandled error in a separate thread stopped the python application.
I’m unsure how the IoT connection got in the bad state, and I couldn’t replicate the issue with sample code within a reasonable amount of time. I included the relevant logs, which hopefully should be enough to mitigate the issues.
Code sample exhibiting the issue
I included a sample code that mimics the behaviour of the python application, with the exception of the HTTP server.
Dockerfile
FROM ubuntu:22.04 AS builder
RUN apt-get update && apt-get install -y --no-install-recommends build-essential curl ca-certificates
RUN apt-get install -y --no-install-recommends python3.10 python3.10-dev python3.10-distutils
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10
WORKDIR /app
COPY ./requirements.txt /app
RUN pip3 install --user --no-cache-dir -r requirements.txt
FROM python:3.10-slim-buster
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY ./src /app
EXPOSE 5000
# Make sure scripts in .local are usable
ENV PATH=/root/.local/bin:$PATH
CMD python3 -m app
requirements.txt
azure-iot-device==2.11.0
azure-storage-blob==12.12.0
azure-storage-file-datalake==12.7.0
bcrypt==3.1.7
fastapi==0.75.0
jsonschema==3.2.0
pandas==1.4.2
pyarrow==7.0.0
croniter==1.0.1
elastic-apm==6.9.1
sortedcontainers==2.3.0
uvicorn==0.17.5
aiohttp==3.8.1
aiofiles==0.8.0
aioprocessing==2.0.0
asyncinotify==2.0.2
uvloop==0.16.0
debugpy==1.6.0
src/__main__.py
# this file mimics the overall behaviour of the python application
# * it listens for device twin, then patches it
# * it send 2 iothub messages every minute
# * it uploads (approximately) 12 files every 5 minutes
# * it doesn't include the http server
import logging
import os
import asyncio
from random import random
from time import time
from azure.core.exceptions import ResourceExistsError
from azure.iot.device.aio import IoTHubDeviceClient
from azure.iot.device import Message
from azure.storage.blob.aio import BlobClient
CONNECTION_STRING = os.getenv('CONNECTION_STRING')
logging.root.setLevel(logging.DEBUG)
async def process_device_twin(device_twin):
await asyncio.sleep(random())
async def process_device_twin_path(patch):
await asyncio.sleep(random())
return {'foo':'bar'}
iot_client: IoTHubDeviceClient = None
async def create_iothub_device():
global iot_client
iot_client = \
IoTHubDeviceClient.create_from_connection_string(CONNECTION_STRING, websockets=True)
await iot_client.connect()
async def create_iothub_device_twin_listener():
device_twin = await iot_client.get_twin()
await process_device_twin(device_twin['desired'])
async def on_device_twin(desired_device_twin):
reported = await process_device_twin_path(desired_device_twin)
await iot_client.patch_twin_reported_properties(reported)
iot_client.on_twin_desired_properties_patch_received = on_device_twin
async def send_messages():
async def worker():
while True:
await asyncio.sleep(10 + (random() - 0.5))
msg = Message('{"foo":"bar"}')
msg.content_type = 'application/json'
msg.content_encoding = 'utf-8'
await iot_client.send_message(msg)
workers = [ worker() for _ in range(2) ]
await asyncio.wait(workers, return_when=asyncio.FIRST_EXCEPTION)
async def _upload_file(file):
data = bytearray(os.urandom(8192))
storage_info = await iot_client.get_storage_info_for_blob(file)
url = "https://{}/{}/{}{}".format(
storage_info["hostName"],
storage_info["containerName"],
storage_info["blobName"],
storage_info["sasToken"]
)
correlation_id = storage_info["correlationId"]
status_code = 200
status_description = f'OK: {file}'
try:
async with BlobClient.from_blob_url(url) as blob_client:
try:
await blob_client.upload_blob(data, overwrite=False, metadata={'file_type':'skip_file'})
except ResourceExistsError:
pass
except Exception as ex:
status_code = ex.status_code if hasattr(ex, 'status_code') else 500
status_description = str(ex)
finally:
await iot_client.notify_blob_upload_status(correlation_id, status_code==200, status_code, status_description)
async def upload_files():
next_sleep_t = time()
while True:
# sleep every 5 minutes
await asyncio.sleep(next_sleep_t - time())
next_sleep_t += 300
for i in range(12):
retries = 0
while True:
try:
await _upload_file(f'testing/{int(next_sleep_t)}-{i}.data')
break
except:
await asyncio.sleep(random() * (1 + 2 ** retries))
retries += 1
await asyncio.sleep(1 + (random() - 0.5))
async def main():
await create_iothub_device()
await create_iothub_device_twin_listener()
iothub_messages = asyncio.ensure_future(send_messages())
iothub_upload = asyncio.ensure_future(upload_files())
await asyncio.wait([iothub_messages, iothub_upload], return_when=asyncio.FIRST_EXCEPTION)
asyncio.run(main())
Console log of the issue
2022-08-20 12:47:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread. Unable to handle.
2022-08-20 12:47:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.transport_exceptions.ConnectionDroppedError: Unexpected disconnection\n']
2022-08-20 12:48:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread. Unable to handle.
2022-08-20 12:48:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
2022-08-20 12:48:58 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-20 12:50:58 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-20 12:53:58 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-20 12:54:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread. Unable to handle.
2022-08-20 12:54:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
...
...
...
2022-08-21 00:00:28 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:01:31 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:02:38 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread. Unable to handle.
2022-08-21 00:02:38 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
2022-08-21 00:02:38 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:04:38 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:05:38 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:06:48 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread. Unable to handle.
2022-08-21 00:06:48 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
2022-08-21 00:06:48 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:08:48 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:09:48 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:10:56 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:12:29 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:12:29 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread. Unable to handle.
2022-08-21 00:12:29 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
2022-08-21 00:13:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread. Unable to handle.
2022-08-21 00:13:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.transport_exceptions.ConnectionDroppedError: transport disconnected\n']
2022-08-21 00:14:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread. Unable to handle.
2022-08-21 00:14:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.transport_exceptions.ConnectionDroppedError: transport disconnected\n']
2022-08-21 00:16:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread. Unable to handle.
2022-08-21 00:16:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.transport_exceptions.ConnectionDroppedError: transport disconnected\n']
2022-08-21 00:17:36 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
Exception in thread Thread-3996 (_thread_main):
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 3591, in _thread_main
self.loop_forever(retry_first_connection=True)
File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 1756, in loop_forever
rc = self._loop(timeout)
File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 1181, in _loop
rc = self.loop_write()
File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 1577, in loop_write
rc = self._packet_write()
File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 2464, in _packet_write
write_length = self._sock_send(
File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 649, in _sock_send
return self._sock.send(buf)
File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 3968, in send
return self._send_impl(data)
File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 3950, in _send_impl
length = self._socket.send(self._sendbuffer)
File "/usr/local/lib/python3.10/ssl.py", line 1208, in send
return super().send(data, flags)
OSError: [Errno 9] Bad file descriptor
Exception in thread Thread-3995:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/threading.py", line 1378, in run
self.function(*self.args, **self.kwargs)
File "/root/.local/lib/python3.10/site-packages/azure/iot/device/common/pipeline/pipeline_thread.py", line 132, in wrapper
return future.result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/.local/lib/python3.10/site-packages/azure/iot/device/common/pipeline/pipeline_thread.py", line 109, in thread_proc
return func(*args, **kwargs)
File "/root/.local/lib/python3.10/site-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 80, in watchdog_function
this.transport.disconnect()
File "/root/.local/lib/python3.10/site-packages/azure/iot/device/common/mqtt_transport.py", line 449, in disconnect
raise err
azure.iot.device.common.transport_exceptions.NoConnectionError: The client is not currently connected.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Actions · Azure/azure-iot-sdk-python
Thread leakage and unhandled exception due to constant disconnects Sync issue to Azure DevOps work item #731: Issue #1037 closed by cartertinney.
Read more >Memory leak C# Async (got stack)
It would appear that the managed ThreadPool is attempting to create a new thread and the runtime is throwing an OutOfMemoryException. Without ...
Read more >Random Disconnections on RDS 2016 - Loads of ESENT ...
recently I've been having loads of rds disconnections on different servers (but not all) where all the users are disconnected and unable to ......
Read more >Process | Node.js v19.3.0 Documentation
The 'rejectionHandled' event is emitted whenever a Promise has been rejected and an error handler was attached to it (using promise.catch() , for...
Read more >Fix list for IBM WebSphere Application Server V8.5
PI99507, Native outofmemory errors due thread leak in OTIS connection handling ... Exception in SIP container caused by many disconnections from the ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@simebg
I would say that there isn’t exactly a way to map errors to being fatal or non-fatal - it’s more contextual. There are errors which are usually recoverable that in certain contexts indicate some kind of more serious failure has occurred - generally when you see the same error over and over again, like you see here. At this point in time the background exception handler is mostly just for reporting.
I am somewhat surprised that this new version completely addressed your issue, but I’m glad to hear it has been resolved (at least for now). If the problem re-emerges, please don’t hesitate to send those logs on over in a new issue.
@cartertinney
With the new version (2.12.0), it appears that the upload functionality is much more stable. With the exception of the occasional timeout exceptions, I haven’t seen any other issues in the monitoring tools. I disabled the restart functionality on the background thread exception handler.
I do have a question regarding this though. Is there a list of fatal exceptions in the documentation? In general, I see that the library recovers by itself, however I would like to add some safety code if possible.
Regarding the debugging level, I’ll keep the sample code running and monitor it occasionally in case I see the issues pop up again.
In summary, I think it’s safe to close this issue. If I see these issues pop up again, I’ll open a new issue and send logs with debug level enabled.