question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Thread leakage and unhandled exception due to constant disconnects

See original GitHub issue

Context

  • OS and version used: Ubuntu 22.04, Docker (python:3.10-slim-buster)
  • Python version: CPython 3.10.5
  • pip version: 22.2.2
  • list of installed packages:
Package                     Version
--------------------------- ---------
aiofiles                    0.8.0
aiohttp                     3.8.1
aioprocessing               2.0.0
aiosignal                   1.2.0
anyio                       3.6.1
asgiref                     3.5.2
async-timeout               4.0.2
asyncinotify                2.0.2
attrs                       22.1.0
azure-core                  1.25.0
azure-iot-device            2.11.0
azure-storage-blob          12.12.0
azure-storage-file-datalake 12.7.0
bcrypt                      3.1.7
certifi                     2022.6.15
cffi                        1.15.1
charset-normalizer          2.1.1
click                       8.1.3
croniter                    1.0.1
cryptography                37.0.4
debugpy                     1.6.0
deprecation                 2.1.0
elastic-apm                 6.9.1
fastapi                     0.75.0
frozenlist                  1.3.1
h11                         0.13.0
idna                        3.3
isodate                     0.6.1
janus                       1.0.0
jsonschema                  3.2.0
msrest                      0.7.1
multidict                   6.0.2
natsort                     8.1.0
numpy                       1.23.2
oauthlib                    3.2.0
packaging                   21.3
paho-mqtt                   1.6.1
pandas                      1.4.2
pip                         22.2.2
pyarrow                     7.0.0
pycparser                   2.21
pydantic                    1.9.2
pyparsing                   3.0.9
pyrsistent                  0.18.1
PySocks                     1.7.1
python-dateutil             2.8.2
pytz                        2022.2.1
requests                    2.28.1
requests-oauthlib           1.3.1
requests-unixsocket         0.3.0
setuptools                  65.2.0
six                         1.16.0
sniffio                     1.2.0
sortedcontainers            2.3.0
starlette                   0.17.1
typing_extensions           4.3.0
urllib3                     1.26.11
uvicorn                     0.17.5
uvloop                      0.16.0
wheel                       0.37.1
yarl                        1.8.1
  • cloned repo: N/A

Description of the issue

Since 18-08-2022, the IoTHub started intermittently returning 500 (Internal Server Error) when using the file upload functionality, with file notifications. After restarting the python applications over the weekend, the HTTP server (uvicorn+uvloop) stopped responding entirely after running for another day.

After checking the logs, I noticed 2 things:

  1. The IoT library kept disconnecting. This seems to create a new thread every time it disconnects.
  2. An unhandled error in a separate thread stopped the python application.

I’m unsure how the IoT connection got in the bad state, and I couldn’t replicate the issue with sample code within a reasonable amount of time. I included the relevant logs, which hopefully should be enough to mitigate the issues.

Code sample exhibiting the issue

I included a sample code that mimics the behaviour of the python application, with the exception of the HTTP server.

Dockerfile

FROM ubuntu:22.04 AS builder

RUN apt-get update && apt-get install -y --no-install-recommends build-essential curl ca-certificates
RUN apt-get install -y --no-install-recommends python3.10 python3.10-dev python3.10-distutils
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10

WORKDIR /app

COPY ./requirements.txt /app
RUN pip3 install --user --no-cache-dir -r requirements.txt

FROM python:3.10-slim-buster

WORKDIR /app
COPY --from=builder /root/.local /root/.local

COPY ./src /app
EXPOSE 5000

# Make sure scripts in .local are usable
ENV PATH=/root/.local/bin:$PATH

CMD python3 -m app

requirements.txt

azure-iot-device==2.11.0
azure-storage-blob==12.12.0
azure-storage-file-datalake==12.7.0
bcrypt==3.1.7
fastapi==0.75.0
jsonschema==3.2.0
pandas==1.4.2
pyarrow==7.0.0
croniter==1.0.1
elastic-apm==6.9.1
sortedcontainers==2.3.0
uvicorn==0.17.5
aiohttp==3.8.1
aiofiles==0.8.0
aioprocessing==2.0.0
asyncinotify==2.0.2
uvloop==0.16.0
debugpy==1.6.0

src/__main__.py

# this file mimics the overall behaviour of the python application
#   * it listens for device twin, then patches it
#   * it send 2 iothub messages every minute
#   * it uploads (approximately) 12 files every 5 minutes
#   * it doesn't include the http server

import logging
import os
import asyncio
from random import random
from time import time

from azure.core.exceptions import ResourceExistsError
from azure.iot.device.aio import IoTHubDeviceClient
from azure.iot.device import Message
from azure.storage.blob.aio import BlobClient

CONNECTION_STRING = os.getenv('CONNECTION_STRING')
logging.root.setLevel(logging.DEBUG)

async def process_device_twin(device_twin):
    await asyncio.sleep(random())

async def process_device_twin_path(patch):
    await asyncio.sleep(random())
    return {'foo':'bar'}

iot_client: IoTHubDeviceClient = None

async def create_iothub_device():
    global iot_client

    iot_client = \
        IoTHubDeviceClient.create_from_connection_string(CONNECTION_STRING, websockets=True)

    await iot_client.connect()

async def create_iothub_device_twin_listener():
    device_twin = await iot_client.get_twin()
    await process_device_twin(device_twin['desired'])

    async def on_device_twin(desired_device_twin):
        reported = await process_device_twin_path(desired_device_twin)
        await iot_client.patch_twin_reported_properties(reported)
    
    iot_client.on_twin_desired_properties_patch_received = on_device_twin


async def send_messages():
    async def worker():
        while True:
            await asyncio.sleep(10 + (random() - 0.5))

            msg = Message('{"foo":"bar"}')
            msg.content_type = 'application/json'
            msg.content_encoding = 'utf-8'

            await iot_client.send_message(msg)

    workers = [ worker() for _ in range(2) ]
    await asyncio.wait(workers, return_when=asyncio.FIRST_EXCEPTION)


async def _upload_file(file):
    data = bytearray(os.urandom(8192))
    storage_info = await iot_client.get_storage_info_for_blob(file)

    url = "https://{}/{}/{}{}".format(
        storage_info["hostName"],
        storage_info["containerName"],
        storage_info["blobName"],
        storage_info["sasToken"]
    )

    correlation_id = storage_info["correlationId"]
    status_code = 200
    status_description = f'OK: {file}'

    try:
        async with BlobClient.from_blob_url(url) as blob_client:
            try:
                await blob_client.upload_blob(data, overwrite=False, metadata={'file_type':'skip_file'})
            except ResourceExistsError:
                pass
    except Exception as ex:
        status_code = ex.status_code if hasattr(ex, 'status_code') else 500
        status_description = str(ex)
    finally:
        await iot_client.notify_blob_upload_status(correlation_id, status_code==200, status_code, status_description)

async def upload_files():
    next_sleep_t = time()

    while True:
        # sleep every 5 minutes
        await asyncio.sleep(next_sleep_t - time())
        next_sleep_t += 300

        for i in range(12):
            retries = 0

            while True:
                try:
                    await _upload_file(f'testing/{int(next_sleep_t)}-{i}.data')
                    break
                except:
                    await asyncio.sleep(random() * (1 + 2 ** retries))
                    retries += 1

            await asyncio.sleep(1 + (random() - 0.5))


async def main():
    await create_iothub_device()
    await create_iothub_device_twin_listener()

    iothub_messages = asyncio.ensure_future(send_messages())
    iothub_upload = asyncio.ensure_future(upload_files())

    await asyncio.wait([iothub_messages, iothub_upload], return_when=asyncio.FIRST_EXCEPTION)

asyncio.run(main())

Console log of the issue

2022-08-20 12:47:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread.  Unable to handle.
2022-08-20 12:47:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.transport_exceptions.ConnectionDroppedError: Unexpected disconnection\n']
2022-08-20 12:48:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread.  Unable to handle.
2022-08-20 12:48:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
2022-08-20 12:48:58 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-20 12:50:58 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-20 12:53:58 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-20 12:54:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread.  Unable to handle.
2022-08-20 12:54:58 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
...
...
...
2022-08-21 00:00:28 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:01:31 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:02:38 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread.  Unable to handle.
2022-08-21 00:02:38 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
2022-08-21 00:02:38 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:04:38 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:05:38 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:06:48 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread.  Unable to handle.
2022-08-21 00:06:48 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
2022-08-21 00:06:48 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:08:48 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:09:48 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:10:56 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:12:29 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
2022-08-21 00:12:29 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread.  Unable to handle.
2022-08-21 00:12:29 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.pipeline.pipeline_exceptions.OperationTimeout: Transport timeout on connection operation\n']
2022-08-21 00:13:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread.  Unable to handle.
2022-08-21 00:13:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.transport_exceptions.ConnectionDroppedError: transport disconnected\n']
2022-08-21 00:14:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread.  Unable to handle.
2022-08-21 00:14:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.transport_exceptions.ConnectionDroppedError: transport disconnected\n']
2022-08-21 00:16:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] Exception caught in background thread.  Unable to handle.
2022-08-21 00:16:36 [7] [ERROR] [azure.iot.device.common.handle_exceptions] ['azure.iot.device.common.transport_exceptions.ConnectionDroppedError: transport disconnected\n']
2022-08-21 00:17:36 [7] [WARNING] [azure.iot.device.common.pipeline.pipeline_stages_base] ReconnectStage: DisconnectEvent received while in unexpected state - CONNECTING, Connected: False
Exception in thread Thread-3996 (_thread_main):
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 3591, in _thread_main
    self.loop_forever(retry_first_connection=True)
  File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 1756, in loop_forever
    rc = self._loop(timeout)
  File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 1181, in _loop
    rc = self.loop_write()
  File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 1577, in loop_write
    rc = self._packet_write()
  File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 2464, in _packet_write
    write_length = self._sock_send(
  File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 649, in _sock_send
    return self._sock.send(buf)
  File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 3968, in send
    return self._send_impl(data)
  File "/root/.local/lib/python3.10/site-packages/paho/mqtt/client.py", line 3950, in _send_impl
    length = self._socket.send(self._sendbuffer)
  File "/usr/local/lib/python3.10/ssl.py", line 1208, in send
    return super().send(data, flags)
OSError: [Errno 9] Bad file descriptor
Exception in thread Thread-3995:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 1378, in run
    self.function(*self.args, **self.kwargs)
  File "/root/.local/lib/python3.10/site-packages/azure/iot/device/common/pipeline/pipeline_thread.py", line 132, in wrapper
    return future.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/.local/lib/python3.10/site-packages/azure/iot/device/common/pipeline/pipeline_thread.py", line 109, in thread_proc
    return func(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/azure/iot/device/common/pipeline/pipeline_stages_mqtt.py", line 80, in watchdog_function
    this.transport.disconnect()
  File "/root/.local/lib/python3.10/site-packages/azure/iot/device/common/mqtt_transport.py", line 449, in disconnect
    raise err
azure.iot.device.common.transport_exceptions.NoConnectionError: The client is not currently connected.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
cartertinneycommented, Aug 29, 2022

@simebg

I would say that there isn’t exactly a way to map errors to being fatal or non-fatal - it’s more contextual. There are errors which are usually recoverable that in certain contexts indicate some kind of more serious failure has occurred - generally when you see the same error over and over again, like you see here. At this point in time the background exception handler is mostly just for reporting.

I am somewhat surprised that this new version completely addressed your issue, but I’m glad to hear it has been resolved (at least for now). If the problem re-emerges, please don’t hesitate to send those logs on over in a new issue.

0reactions
simebgcommented, Aug 29, 2022

@cartertinney

With the new version (2.12.0), it appears that the upload functionality is much more stable. With the exception of the occasional timeout exceptions, I haven’t seen any other issues in the monitoring tools. I disabled the restart functionality on the background thread exception handler.

I do have a question regarding this though. Is there a list of fatal exceptions in the documentation? In general, I see that the library recovers by itself, however I would like to add some safety code if possible.

Regarding the debugging level, I’ll keep the sample code running and monitor it occasionally in case I see the issues pop up again.

In summary, I think it’s safe to close this issue. If I see these issues pop up again, I’ll open a new issue and send logs with debug level enabled.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Actions · Azure/azure-iot-sdk-python
Thread leakage and unhandled exception due to constant disconnects Sync issue to Azure DevOps work item #731: Issue #1037 closed by cartertinney.
Read more >
Memory leak C# Async (got stack)
It would appear that the managed ThreadPool is attempting to create a new thread and the runtime is throwing an OutOfMemoryException. Without ...
Read more >
Random Disconnections on RDS 2016 - Loads of ESENT ...
recently I've been having loads of rds disconnections on different servers (but not all) where all the users are disconnected and unable to ......
Read more >
Process | Node.js v19.3.0 Documentation
The 'rejectionHandled' event is emitted whenever a Promise has been rejected and an error handler was attached to it (using promise.catch() , for...
Read more >
Fix list for IBM WebSphere Application Server V8.5
PI99507, Native outofmemory errors due thread leak in OTIS connection handling ... Exception in SIP container caused by many disconnections from the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found