Possibly broken calling convention for producev on osx-aarch64
See original GitHub issueDescription
Disclaimer upfront, a lot of this could be incorrect, but it is what I managed to get debugging the issue so far.
While trying to get the library working on m1 mac with manually linked osx-aarch64 dylib I ran into an issue with calling producev
. My primary suspect is that either compiled code is not entirely correct or calling convention from dotnet side is incorrect on this os+arch combo. Both of these point to variadic arg usage in rd_kafka_producev
.
Decompiling dylib shows that function has signature that looks something like:
int _rd_kafka_producev(long kafka_handle,undefined8 *topic_tag,ulong topic,byte *partition_tag,
byte *partition,byte *value_tag,byte *value_ptr,byte *value_len)
These occupy registers x0-x7. The rest of parameters seem to be pushed to the stack.
x0 = 0x000000013000a200
x1 = 0x0000000000000001
x2 = 0x000000017085c330
x3 = 0x0000000000000003
x4 = 0x00000000ffffffff
x5 = 0x0000000000000004
x6 = 0x00000004800c3fa0
x7 = 0x0000000000000004
x8 = 0x000000013382153c librdkafka.dylib`rd_kafka_producev
These are correct and match what is being passed, e.g. x2:
(lldb) x $x2
0x17085c330: 71 75 69 63 6b 73 74 61 72 74 00 00 01 00 00 00 quickstart......
Now the issue becomes apparent when it reaches while-switch loop at: https://github.com/edenhill/librdkafka/blob/v1.8.2/src/rdkafka_msg.c#L560
As I understand, this should loop through all variadic arguments by getting tag value, jumping to appropriate case branch and taking more varargs. Repeating this till end tag is reached. But how it is evaluated in osx-aarch64 is different. It seems to do the switch using stack only, jumping to tag 5 in very first iteration. Stepping through the rest of the loop it also proves that, the branches that are visited are in this order: 5, 6, 7, 8, 10, 0/default. Since first branches are skipped, the topic is not set (rkt
). This causes it to return INVALID_ARG at https://github.com/edenhill/librdkafka/blob/v1.8.2/src/rdkafka_msg.c#L639 .
Since there is also an issue in librdkafka repository that wants runtime added to redistributable https://github.com/edenhill/librdkafka/issues/3546 and also issue here https://github.com/confluentinc/confluent-kafka-dotnet/issues/1707 , to me it seems that the fix is not as simple as packaging such build.
I also looked at go package, as people seem to have got it working on M1 ( https://github.com/confluentinc/confluent-kafka-go/issues/591 ). Difference there is that it uses cgo and adds do_produce
with non variadic arguments which calls producev
. From go code only do_produce
is used.
Going back to why I believe varargs are at fault, there are a few articles talking about how calling convention was changed and how things are broken specifically on M1 in certain cases:
- https://cpufun.substack.com/p/what-about-?s=r
- https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms#Update-Code-that-Passes-Arguments-to-Variadic-Functions
- https://github.com/dotnet/runtime/issues/48752#issuecomment-786112023
Considering that there is rd_kafka_produceva
which looks like it was introduced in order to avoid va-args calling convention ( https://github.com/edenhill/librdkafka/pull/2902 ), maybe it would make sense to switch to that as a solution?
How to reproduce
brew install librdkafka
- Create new dotnet core 6 project with basic kafka producer example (I used first example from https://github.com/confluentinc/confluent-kafka-dotnet#usage ).
- Link brew librdkafka in csproj:
<ItemGroup Condition="'$([System.Runtime.InteropServices.RuntimeInformation]::OSArchitecture)' == 'Arm64' And '$([System.Runtime.InteropServices.RuntimeInformation]::IsOSPlatform($([System.Runtime.InteropServices.OSPlatform]::OSX)))' == 'True'">
<Content Include="/opt/homebrew/Cellar/librdkafka/1.8.2/lib/librdkafka.dylib">
<Link>librdkafka.dylib</Link>
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
</ItemGroup>
- Launch kafka (I used docker-compose from https://developer.confluent.io/quickstart/kafka-docker/ )
- Try to produce a message.
The application should run, indicating that the library is working and linked, but produce will return -186
return code.
Checklist
Please provide the following information:
- A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
- Confluent.Kafka nuget version. 1.8.2
- Apache Kafka version. 7.0.1 docker images
- Client configuration.
- Operating system. OSX Aarch64
- Provide logs (with “debug” : “…” as necessary in configuration).
- Provide broker log excerpts.
- Critical issue.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:19 (12 by maintainers)
Top GitHub Comments
sounds good to me - thanks! i’ll review it for inclusion in v1.9.0
The error reported by @niemyjski is due to dotnet not being able to load the x64 librdkafka binary. It seems likely to me there is a way to make it work (given rosetta etc.), but I don’t know what that is, and it’ll be less optimal than running a native build. One way to make this work is compile librdkafka from source (configure, make, make install) and overwrite ~/.nuget/packages/librdkafka.redist/1.8.2/runtimes/osx-x64/native/librdkafka.dylib with the one from /usr/local/lib/librdkafka.1.dylib
Or use Library.Load before calling any other Confluent Kafka methods (that approach may be broken, I’m unsure).
@edenhill - i think we should start including an apple silicon build in librdkafka.redist. it’ll be well used.