question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Eventhubs extension crashes with segmentation fault, either SIGSEGV or SIGABORT from send_batch()

See original GitHub issue
  • Package Name: azure-eventhub
  • Package Version: 5.2.0. Originally detected at 5.1.0, but also present when upgrading to 5.2.0 (latest at time of writing).
  • Operating System: Linux, Debian kernel 4.9.168-1-amd64
  • Python Version: 3.5 Possibly related to Issue #9435: https://github.com/Azure/azure-sdk-for-python/issues/9435

Describe the bug Python program crashes with a segmentation fault (and nothing else) when uploading data to an eventhub. After a fair amount of debugging using gdb and eliminating all other factors (it is not in our code, as we did a complete dry run with everything minus the actual eventhub sending), we find the following error(s) in the eventhubs extension: output of gdb: *** Error in `/usr/bin/python3’: double free or corruption (fasttop): 0x0000555556ca03a0 ***
signals: *** Program received signal SIGABT, Aborted. *** Program received signal SIGSEGV, Segmentation fault.

*** Stack traces: **** Stack trace from gdb (py-bt command, for the SIGABRT error):

(gdb) py-bt
Traceback (most recent call first):
  <built-in method send of uamqp.c_uamqp.cMessageSender object at remote 0x7fffef69dc08>
  File "/usr/local/lib/python3.5/dist-packages/uamqp/sender.py", line 246, in send
    return self._sender.send(c_message, timeout, message)
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 605, in _transfer_message
    sent = self.message_handler.send(message, self._on_message_sent, timeout=timeout)
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 626, in _filter_pending
    self._transfer_message(message, timeout)
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 645, in _client_run
    self._pending_messages = self._filter_pending()
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 397, in do_work
    return self._client_run()
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 756, in wait
    running = self.do_work()
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_producer.py", line 161, in _send_event_data
    self._handler.wait()  # type: ignore
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_client_base.py", line 454, in _do_retryable_operation
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_producer.py", line 171, in _send_event_data_with_retry
    return self._do_retryable_operation(self._send_event_data, timeout=timeout)
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_producer.py", line 262, in send
    self._send_event_data_with_retry(timeout=timeout)
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_producer_client.py", line 245, in send_batch
    to_send_batch, timeout=send_timeout
  File "/home/user/program/program.py", line 173, in send_batch_of_data
    producer.send_batch(event_data_batch)
  File "/home/user/program/program.py", line 300, in main
    print("Sending all new data...")
  File "program_script.py", line 4, in <module>
    program.main()

**** Stack trace from gdb (py-bt command, for the SIGSEGV error):

(gdb) py-bt
Traceback (most recent call first):
  <built-in method send of uamqp.c_uamqp.cMessageSender object at remote 0x7fffef6c7c48>
  File "/usr/local/lib/python3.5/dist-packages/uamqp/sender.py", line 246, in send
    return self._sender.send(c_message, timeout, message)
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 601, in _transfer_message
    sent = self.message_handler.send(message, self._on_message_sent, timeout=timeout)
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 622, in _filter_pending
    self._transfer_message(message, timeout)
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 641, in _client_run
    self._pending_messages = self._filter_pending()
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 397, in do_work
    return self._client_run()
  File "/usr/local/lib/python3.5/dist-packages/uamqp/client.py", line 752, in wait
    running = self.do_work()
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_producer.py", line 161, in _send_event_data
    self._handler.wait()  # type: ignore
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_client_base.py", line 454, in _do_retryable_operation
    **kwargs
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_producer.py", line 171, in _send_event_data_with_retry
    return self._do_retryable_operation(self._send_event_data, timeout=timeout)
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_producer.py", line 262, in send
    self._send_event_data_with_retry(timeout=timeout)
  File "/usr/local/lib/python3.5/dist-packages/azure/eventhub/_producer_client.py", line 245, in send_batch
    to_send_batch, timeout=send_timeout
  File "/home/user/program/program.py", line 156, in send_batch_of_data
    producer.send_batch(event_data_batch)
  File "/home/user/program/program.py", line 263, in main
    latest_id = send_batch_of_data(producer,
  File "program_script.py", line 4, in <module>
    program.main()

To Reproduce I cannot send you our entire codebase of getting data etc, or the actual data, but, in theory, this should hopefully be sufficient: Steps to reproduce the behavior:

  1. Get a fair amount of data, around 10-15 million records/rows. In our case, it comes from a database using sqlalchemy. We use the query.yield_per(1000) method to not load that many rows in memory all at once.
  2. Open an eventhub connection:
producer = EventHubProducerClient.from_connection_string(
	conn_str="<connection string here, e.g.: Endpoint=sb://........>",
	eventhub_name="<name here>")

  1. Convert & upload data in batches, in JSON form, trimmed down to essentials:
event_data_batch = producer.create_batch()
for row in data:
	json_object = {
		"id": row.id,
		# And more data stuff here of course, in our case about 10 more basic values, nothing fancy
	}

	json_string = json.dumps(json_object, indent=4, sort_keys=True)
	event_data = EventData(json_string)

	try:
		event_data_batch.add(event_data)
	except Exception as e:
		# Reached max data batch size, send it and create a new one
		# segfault/sigabort will occur on the next line, but not consistently... :(
		producer.send_batch(event_data_batch)
		event_data_batch = producer.create_batch()
		event_data_batch.add(event_data)

# And a last send_batch() call here to upload the final batch of data with code pretty much the same as above.
  1. Get a “segmentation fault” when running the program. It does not always happen at the exact same “time”, but it does happen at the exact same line of code as mentioned in the code comment above in the previous step. So it is independent of the actual data being send. Furthermore, even though the error and/or stack trace indicate a network issue, the program is run on a dedicated VPS, with a gigabit fibre internet connection, so it’s not e.g. a flaky 4G connection or something and should therefore be sufficiently stable.

Expected behavior No segmentation fault and no hard crash, and just upload the data. Or a Python exception would be also fine if something went wrong, but not a hard crash like this. E.g. even PDB isn’t able to gracefully handle it, and also crashes.

Screenshots I can add screenshots, but I think the stack traces and provided info should be sufficient. If not, let me know, I can run GDB etc and/or provide more info if needed. But, I cannot share you our entire codebase or the actual data being sent. The code above is exactly what happens, minus some details irrelevant to the bug.

Additional context It does not happen consistently at the exact same time (e.g. after X amount of data being sent), but it does happen at the exact same line, eventually, with either of the 2 signals being fired: SIGSEGV or SIGABRT. Given the stack trace it could be a network issue, but as said, it’s run on a dedicated VPS. The connection string should be correct (some data is being sent and received before the crash), so I would have expected an error of that at the first call of send_batch(), or the eventhubs connect, rather than a random amount of calls later. Also, the python program does not do multiple processes or multiple threads: it’s completely single-threaded. Frankly, I’m at a loss, and I hope this is fixable…

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
yunhaolingcommented, Apr 7, 2021

hey@MR-KO, thanks for your patience! we have fixed the issue in azure-eventhub 5.4.0. please update to the latest version via pip install azure-eventhub --upgrade. (If you’re interested, the root cause lies in uamqp, and analysis could be found here: https://github.com/Azure/azure-uamqp-python/pull/217#issue-595648009)

I’m closing this now, feel free to reopen if you’re still encountering the issue, thanks!

1reaction
MR-KOcommented, Mar 15, 2021

Hi @yunhaoling, no problem, I know how the job goes as a fellow software engineer 😃. I am already glad of the ongoing effort. Great to hear you can reproduce it. I took your code, added my config/login details, and also got a stack trace etc as you’d expect. There are a few things of interest to note:

  1. Our application follows almost the exact same pattern. At first, the eventhubs connection/login details are tested to ensure that actually works before doing heavy DB access etc. Then, that eventhubs connection is fully closed, data is gathered etc, and only when data to send is available, the eventhubs connection is made again. Immediately after that, data will be send using the azure.eventhub.EventHubProducerClient, with batches of EventData (see code in the issue description). We do not (of course) do a time.sleep(), and our DB access seems to also not generate such huge delays between rows of data (batches of 1000 rows). I am currently verifying this explicitly to see if that is the case and will get back to you (it’s running now). EDIT: cannot verify this as it crashes at the producer.send_batch(event_data_batch) call and hence I got no more output/logs. However, until then, there is no huge time gap between rows of data (its in the order of microseconds), suggesting that the DB access is not the issue…
  2. Having said that, the stacktrace from py-bt-full is almost exactly the same (as the one from my previous comment) for the first 5 functions, after that the last function of this stack trace differs (which is uamqp/client.py, line 725). Of course the remaining azure eventhub stuff is missing now.

stack trace:

(gdb) py-bt-full
#10 <built-in method send of uamqp.c_uamqp.cMessageSender object at remote 0x7ffff4303ac8>
#13 Frame 0x55555601f7c8, for file /usr/local/lib/python3.5/dist-packages/uamqp/sender.py, line 246, in send (self=<MessageSender(source=<uamqp.c_uamqp.CompositeValue at remote 0x7ffff430f090>, error_policy=<ErrorPolicy(max_retries=3, _on_error=None) at remote 0x7ffff42ff9b0>, _state=<MessageSenderState(_value_=1, __objclass__=<EnumMeta(Open=<MessageSenderState(_value_=3, __objclass__=<...>, _name_='Open') at remote 0x7ffff5797630>, Closing=<MessageSenderState(_value_=4, __objclass__=<...>, _name_='Closing') at remote 0x7ffff57976a0>, _member_map_={'Closing': <...>, 'Open': <...>, 'Idle': <...>, 'Opening': <MessageSenderState(_value_=2, __objclass__=<...>, _name_='Opening') at remote 0x7ffff5797668>, 'Error': <Mess_mapenderState(_value_=5, __objclass__=<...>, _name_='Error') at remote 0x7ffff57976d8>}, _member_names_=['Idle', 'Opening', 'Open', 'Closing', 'Error'], Opening=<...>, _value2member_map_={1: <...>, 2: <...>, 3: <...>, 4: <...>, 5: <...>}, Error=<...>, __doc__='An enumeration.', __module__='uamqp.consta...(truncated)
return self._sender.send(c_message, timeout, message)
#17 Frame 0x7ffff4317408, for file /usr/local/lib/python3.5/dist-packages/uamqp/client.py, line 605, in _transfer_message (self=<SendClient(_channel_max=None, _link_properties=None, _hostname='<hostname here>', _keep_alive_thread=None, message_handler=<MessageSender(source=<uamqp.c_uamqp.CompositeValue at remote 0x7ffff430f090>, error_policy=<ErrorPolicy(max_retries=3, _on_error=None) at remote 0x7ffff42ff9b0>, _state=<MessageSenderState(_value_=1, __objclass__=<EnumMeta(Open=<MessageSenderState(_value_=3, __objclass__=<...>, _name_='Open') at remote 0x7ffff5797630>, Closing=<MessageSenderState(_value_=4, __objclass__=<...>, _name_='Closing') at remote 0x7ffff57976a0>, _membenderp_={'Closing': <...>, 'Open': <...>, 'Idle': <...>, 'Opening': <MessageSenderState(_value_=2, __objclass__=<...>, _name_='Opening') at remote 0x7ffff5797668>, 'Error': <MessageSenderState(_value_=5, __objclass__=<...>, _name_='Error') at remote 0x7ffff57976d8>}, _member_names_=['Idle', 'Opening', 'O...(truncated)
sent = self.message_handler.send(message, self._on_message_sent, timeout=timeout)
#20 Frame 0x55555601ecf8, for file /usr/local/lib/python3.5/dist-packages/uamqp/client.py, line 626, in _filter_pending (self=<SendClient(_channel_max=None, _link_properties=None, _hostname='<hostname here>', _keep_alive_thread=None, message_handler=<MessageSender(source=<uamqp.c_uamqp.CompositeValue at remote 0x7ffff430f090>, error_policy=<ErrorPolicy(max_retries=3, _on_error=None) at remote 0x7ffff42ff9b0>, _state=<MessageSenderState(_value_=1, __objclass__=<EnumMeta(Open=<MessageSenderState(_value_=3, __objclass__=<...>, _name_='Open') at remote 0x7ffff5797630>, Closing=<MessageSenderState(_value_=4, __objclass__=<...>, _name_='Closing') at remote 0x7ffff57976a0>, _member_erSt={'Closing': <...>, 'Open': <...>, 'Idle': <...>, 'Opening': <MessageSenderState(_value_=2, __objclass__=<...>, _name_='Opening') at remote 0x7ffff5797668>, 'Error': <MessageSenderState(_value_=5, __objclass__=<...>, _name_='Error') at remote 0x7ffff57976d8>}, _member_names_=['Idle', 'Opening', 'Ope...(truncated)
self._transfer_message(message, timeout)
#23 Frame 0x7ffff4315c50, for file /usr/local/lib/python3.5/dist-packages/uamqp/client.py, line 645, in _client_run (self=<SendClient(_channel_max=None, _link_properties=None, _hostname='<hostname here>', _keep_alive_thread=None, message_handler=<MessageSender(source=<uamqp.c_uamqp.CompositeValue at remote 0x7ffff430f090>, error_policy=<ErrorPolicy(max_retries=3, _on_error=None) at remote 0x7ffff42ff9b0>, _state=<MessageSenderState(_value_=1, __objclass__=<EnumMeta(Open=<MessageSenderState(_value_=3, __objclass__=<...>, _name_='Open') at remote 0x7ffff5797630>, Closing=<MessageSenderState(_value_=4, __objclass__=<...>, _name_='Closing') at remote 0x7ffff57976a0>, _member_map_ate(losing': <...>, 'Open': <...>, 'Idle': <...>, 'Opening': <MessageSenderState(_value_=2, __objclass__=<...>, _name_='Opening') at remote 0x7ffff5797668>, 'Error': <MessageSenderState(_value_=5, __objclass__=<...>, _name_='Error') at remote 0x7ffff57976d8>}, _member_names_=['Idle', 'Opening', 'Open', ...(truncated)
self._pending_messages = self._filter_pending()
#26 Frame 0x7ffff43161f0, for file /usr/local/lib/python3.5/dist-packages/uamqp/client.py, line 397, in do_work (self=<SendClient(_channel_max=None, _link_properties=None, _hostname='<hostname here>', _keep_alive_thread=None, message_handler=<MessageSender(source=<uamqp.c_uamqp.CompositeValue at remote 0x7ffff430f090>, error_policy=<ErrorPolicy(max_retries=3, _on_error=None) at remote 0x7ffff42ff9b0>, _state=<MessageSenderState(_value_=1, __objclass__=<EnumMeta(Open=<MessageSenderState(_value_=3, __objclass__=<...>, _name_='Open') at remote 0x7ffff5797630>, Closing=<MessageSenderState(_value_=4, __objclass__=<...>, _name_='Closing') at remote 0x7ffff57976a0>, _member_map_={'C_valng': <...>, 'Open': <...>, 'Idle': <...>, 'Opening': <MessageSenderState(_value_=2, __objclass__=<...>, _name_='Opening') at remote 0x7ffff5797668>, 'Error': <MessageSenderState(_value_=5, __objclass__=<...>, _name_='Error') at remote 0x7ffff57976d8>}, _member_names_=['Idle', 'Opening', 'Open', 'Clo...(truncated)
return self._client_run()
#29 Frame 0x55555601f548, for file /usr/local/lib/python3.5/dist-packages/uamqp/client.py, line 725, in send_message (self=<SendClient(_channel_max=None, _link_properties=None, _hostname='<hostname here>', _keep_alive_thread=None, message_handler=<MessageSender(source=<uamqp.c_uamqp.CompositeValue at remote 0x7ffff430f090>, error_policy=<ErrorPolicy(max_retries=3, _on_error=None) at remote 0x7ffff42ff9b0>, _state=<MessageSenderState(_value_=1, __objclass__=<EnumMeta(Open=<MessageSenderState(_value_=3, __objclass__=<...>, _name_='Open') at remote 0x7ffff5797630>, Closing=<MessageSenderState(_value_=4, __objclass__=<...>, _name_='Closing') at remote 0x7ffff57976a0>, _member_maptateClosing': <...>, 'Open': <...>, 'Idle': <...>, 'Opening': <MessageSenderState(_value_=2, __objclass__=<...>, _name_='Opening') at remote 0x7ffff5797668>, 'Error': <MessageSenderState(_value_=5, __objclass__=<...>, _name_='Error') at remote 0x7ffff57976d8>}, _member_names_=['Idle', 'Opening', 'Open',...(truncated)
running = self.do_work()
#33 Frame 0x7ffff6cc6828, for file eh_debug_test.py, line 42, in <module> ()
send_client.send_message(message)

So it seems very likely that this is indeed the culprit!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Eventhubs extension crashes with segmentation fault, either ...
Eventhubs extension crashes with segmentation fault, either SIGSEGV or SIGABORT from send_batch() - azure-sdk-for-python Python.
Read more >
Intermittent SIGSEV (segfault), SIGABORT and process hangs ...
Either the process hangs, and does not respond to SIGQUIT or SIGTERM, or it fails with either SIGSEGV, or SIGABRT. Here's an example...
Read more >
[HTTP/3] SIGABRT in stress tests · Issue #72696 · dotnet/runtime
Today's stress test run crashed with segmentation fault after 28 ... was a segfault (exit code 139), all others are sigabrt (exit code...
Read more >
Issue 34922: hashlib segmentation fault - Python tracker
Anyone can cause segfaults or do damage if they have unrestricted access to a Python interpreter; that's a threat model for any language...
Read more >
Debugging crashes reported by abrt
Signal 11 ( SIGSEGV ) = Segmentation fault, bus error, or access violation. It is generally an attempt to access memory that the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found