question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possibly broken calling convention for producev on osx-aarch64

See original GitHub issue

Description

Disclaimer upfront, a lot of this could be incorrect, but it is what I managed to get debugging the issue so far.

While trying to get the library working on m1 mac with manually linked osx-aarch64 dylib I ran into an issue with calling producev. My primary suspect is that either compiled code is not entirely correct or calling convention from dotnet side is incorrect on this os+arch combo. Both of these point to variadic arg usage in rd_kafka_producev.

Decompiling dylib shows that function has signature that looks something like:

int _rd_kafka_producev(long kafka_handle,undefined8 *topic_tag,ulong topic,byte *partition_tag,
                      byte *partition,byte *value_tag,byte *value_ptr,byte *value_len)

These occupy registers x0-x7. The rest of parameters seem to be pushed to the stack.

        x0 = 0x000000013000a200
        x1 = 0x0000000000000001
        x2 = 0x000000017085c330
        x3 = 0x0000000000000003
        x4 = 0x00000000ffffffff
        x5 = 0x0000000000000004
        x6 = 0x00000004800c3fa0
        x7 = 0x0000000000000004
        x8 = 0x000000013382153c  librdkafka.dylib`rd_kafka_producev

These are correct and match what is being passed, e.g. x2:

(lldb) x $x2
0x17085c330: 71 75 69 63 6b 73 74 61 72 74 00 00 01 00 00 00  quickstart......

Now the issue becomes apparent when it reaches while-switch loop at: https://github.com/edenhill/librdkafka/blob/v1.8.2/src/rdkafka_msg.c#L560

As I understand, this should loop through all variadic arguments by getting tag value, jumping to appropriate case branch and taking more varargs. Repeating this till end tag is reached. But how it is evaluated in osx-aarch64 is different. It seems to do the switch using stack only, jumping to tag 5 in very first iteration. Stepping through the rest of the loop it also proves that, the branches that are visited are in this order: 5, 6, 7, 8, 10, 0/default. Since first branches are skipped, the topic is not set (rkt). This causes it to return INVALID_ARG at https://github.com/edenhill/librdkafka/blob/v1.8.2/src/rdkafka_msg.c#L639 .

Since there is also an issue in librdkafka repository that wants runtime added to redistributable https://github.com/edenhill/librdkafka/issues/3546 and also issue here https://github.com/confluentinc/confluent-kafka-dotnet/issues/1707 , to me it seems that the fix is not as simple as packaging such build.

I also looked at go package, as people seem to have got it working on M1 ( https://github.com/confluentinc/confluent-kafka-go/issues/591 ). Difference there is that it uses cgo and adds do_produce with non variadic arguments which calls producev. From go code only do_produce is used.

Going back to why I believe varargs are at fault, there are a few articles talking about how calling convention was changed and how things are broken specifically on M1 in certain cases:

Considering that there is rd_kafka_produceva which looks like it was introduced in order to avoid va-args calling convention ( https://github.com/edenhill/librdkafka/pull/2902 ), maybe it would make sense to switch to that as a solution?

How to reproduce

  1. brew install librdkafka
  2. Create new dotnet core 6 project with basic kafka producer example (I used first example from https://github.com/confluentinc/confluent-kafka-dotnet#usage ).
  3. Link brew librdkafka in csproj:
    <ItemGroup Condition="'$([System.Runtime.InteropServices.RuntimeInformation]::OSArchitecture)' == 'Arm64' And '$([System.Runtime.InteropServices.RuntimeInformation]::IsOSPlatform($([System.Runtime.InteropServices.OSPlatform]::OSX)))' == 'True'">
        <Content Include="/opt/homebrew/Cellar/librdkafka/1.8.2/lib/librdkafka.dylib">
            <Link>librdkafka.dylib</Link>
            <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
        </Content>
    </ItemGroup>
  1. Launch kafka (I used docker-compose from https://developer.confluent.io/quickstart/kafka-docker/ )
  2. Try to produce a message.

The application should run, indicating that the library is working and linked, but produce will return -186 return code.

Checklist

Please provide the following information:

  • A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • Confluent.Kafka nuget version. 1.8.2
  • Apache Kafka version. 7.0.1 docker images
  • Client configuration.
  • Operating system. OSX Aarch64
  • Provide logs (with “debug” : “…” as necessary in configuration).
  • Provide broker log excerpts.
  • Critical issue.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:19 (12 by maintainers)

github_iconTop GitHub Comments

8reactions
mhowlettcommented, Mar 4, 2022

sounds good to me - thanks! i’ll review it for inclusion in v1.9.0

2reactions
mhowlettcommented, Jun 1, 2022

The error reported by @niemyjski is due to dotnet not being able to load the x64 librdkafka binary. It seems likely to me there is a way to make it work (given rosetta etc.), but I don’t know what that is, and it’ll be less optimal than running a native build. One way to make this work is compile librdkafka from source (configure, make, make install) and overwrite ~/.nuget/packages/librdkafka.redist/1.8.2/runtimes/osx-x64/native/librdkafka.dylib with the one from /usr/local/lib/librdkafka.1.dylib

Or use Library.Load before calling any other Confluent Kafka methods (that approach may be broken, I’m unsure).

@edenhill - i think we should start including an apple silicon build in librdkafka.redist. it’ll be well used.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Consequence of violating macOS's ARM64 calling ...
I'm porting some AArch64/ARM64/Apple Silicon assembly code from Linux to macOS. This code uses all 31 available registers (stack pointer doesn't ...
Read more >
Support SystemV AMD64 ABI calling convention
I'm currently working on a program that loads a custom executable file (not .exe ) and then calls a method within it's code....
Read more >
Assembly Register Calling Convention Tutorial
In this tutorial, you'll look at registers the CPU uses and explore and modify parameters passed into function calls.
Read more >
A Guide to ARM64 / AArch64 Assembly on Linux ... - modexp
This post is an introduction to ARM64 assembly and will not cover any ... Profiles; Operating Systems; Registers; Calling Convention ...
Read more >
Attributes in Clang — Clang 18.0.0git documentation
On AArch64 targets, this attribute changes the calling convention of a function to preserve additional Scalable Vector registers and Scalable Predicate ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found