question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consumer subscribers are not being called.

See original GitHub issue

Description

We’ve been writing a Foundatio message bus implementation around Kafka and noticed that our tests are extremely flakey in some cases (https://github.com/FoundatioFx/Foundatio.Kafka/actions all test failures). The commonality so far is when we have multiple consumers listening to the same topic, the consumers are never notified of a topic message. 1.9.0 helped a lot with reliability locally but still get failures at random. I have a 5900x with a lot of resources locally compared to the build server.

How to reproduce

  1. Clone https://github.com/FoundatioFx/Foundatio.Kafka
  2. Run docker compose up in the cloned folder.
  3. Open the solution and run dotnet test.

The test KafkaMessageBusTests.CanSendMessageToMultipleSubscribersAsync seems to be the test most easily to reproduce this error (after a few runs) and is the simplest.

[Fact]
public override async Task CanSendMessageToMultipleSubscribersAsync() {
    var messageBus = GetMessageBus();
    if (messageBus == null)
        return;

    try {
        var countdown = new AsyncCountdownEvent(3);
        await messageBus.SubscribeAsync<SimpleMessageA>(msg => {
            Assert.Equal("Hello", msg.Data);
            countdown.Signal();
        });
        await messageBus.SubscribeAsync<SimpleMessageA>(msg => {
            Assert.Equal("Hello", msg.Data);
            countdown.Signal();
        });
        await messageBus.SubscribeAsync<SimpleMessageA>(msg => {
            Assert.Equal("Hello", msg.Data);
            countdown.Signal();
        });
        await messageBus.PublishAsync(new SimpleMessageA {
            Data = "Hello"
        });

        await countdown.WaitAsync(TimeSpan.FromSeconds(2));
        Assert.Equal(0, countdown.CurrentCount);
    } finally {
        await CleanupMessageBusAsync(messageBus);
    }
}

Under the hood, each call to subscribe will ensure topic exists, then create a consumer subscriber listening in a loop, only if an existing listener isn’t already running (at most one listener per bus instance). I’ve included logs of varying detail.

Checklist

Please provide the following information:

  • A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • Confluent.Kafka nuget version (latest 1.9.0 rc)
  • Apache Kafka version (latest)
  • Client configuration.
  • Operating system (all)
  • Provide logs (with “debug” : “…” as necessary in configuration).
  • Provide broker log excerpts.
  • Critical issue.

Test logs with handlers not commented out (https://github.com/FoundatioFx/Foundatio.Kafka/blob/main/src/Foundatio.Kafka/Messaging/KafkaMessageBus.cs#L167-L173)

I saw a similar issue and I wondered if I should not be using these handlers (https://github.com/ah-/rdkafka-dotnet/issues/61). Upon googling I came across this which I didn’t know if it was similar (https://github.com/confluentinc/confluent-kafka-python/issues/970)

See the following gist for all the unit logs (both passing and failing with variying levels of debug logs). GitHub wouldn’t let me post it here as it said the commend was too long: https://gist.github.com/niemyjski/bac539002aa046738d6e029d0d1ba688

Broker logs from recent failure

kafka                   | [2022-06-08 20:11:53,310] INFO Creating topic test_1ab960c172c84b5caf31c969d87c5a4f with configuration {} and initial partition assignment Map(0 -> ArrayBuffer(1)) (kafka.zk.AdminZkClient)
kafka                   | [2022-06-08 20:11:53,319] INFO [Controller id=1] New topics: [Set(test_1ab960c172c84b5caf31c969d87c5a4f)], deleted topics: [Set()], new partition replica assignment [Set(TopicIdReplicaAssignment(test_1ab960c172c84b5caf31c969d87c5a4f,Some(CJssgWbAThWmPO9sa96mYg),Map(test_1ab960c172c84b5caf31c969d87c5a4f-0 -> ReplicaAssignment(replicas=1, addingReplicas=, removingReplicas=))))] (kafka.controller.KafkaController)
kafka                   | [2022-06-08 20:11:53,319] INFO [Controller id=1] New partition creation callback for test_1ab960c172c84b5caf31c969d87c5a4f-0 (kafka.controller.KafkaController)
kafka                   | [2022-06-08 20:11:53,319] INFO [Controller id=1 epoch=1] Changed partition test_1ab960c172c84b5caf31c969d87c5a4f-0 state from NonExistentPartition to NewPartition with assigned replicas 1 (state.change.logger)
kafka                   | [2022-06-08 20:11:53,319] INFO [Controller id=1 epoch=1] Sending UpdateMetadata request to brokers Set() for 0 partitions (state.change.logger)
kafka                   | [2022-06-08 20:11:53,320] INFO [Controller id=1 epoch=1] Sending UpdateMetadata request to brokers Set() for 0 partitions (state.change.logger)
kafka                   | [2022-06-08 20:11:53,328] INFO [Controller id=1 epoch=1] Changed partition test_1ab960c172c84b5caf31c969d87c5a4f-0 from NewPartition to OnlinePartition with state LeaderAndIsr(leader=1, leaderEpoch=0, isr=List(1), leaderRecoveryState=RECOVERED, zkVersion=0) (state.change.logger)
kafka                   | [2022-06-08 20:11:53,328] INFO [Controller id=1 epoch=1] Sending LeaderAndIsr request to broker 1 with 1 become-leader and 0 become-follower partitions (state.change.logger)
kafka                   | [2022-06-08 20:11:53,328] INFO [Controller id=1 epoch=1] Sending UpdateMetadata request to brokers Set(1) for 1 partitions (state.change.logger)
kafka                   | [2022-06-08 20:11:53,328] INFO [Controller id=1 epoch=1] Sending UpdateMetadata request to brokers Set() for 0 partitions (state.change.logger)
kafka                   | [2022-06-08 20:11:53,328] INFO [Broker id=1] Handling LeaderAndIsr request correlationId 1317 from controller 1 for 1 partitions (state.change.logger)
kafka                   | [2022-06-08 20:11:53,329] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(test_1ab960c172c84b5caf31c969d87c5a4f-0) (kafka.server.ReplicaFetcherManager)
kafka                   | [2022-06-08 20:11:53,329] INFO [Broker id=1] Stopped fetchers as part of LeaderAndIsr request correlationId 1317 from controller 1 epoch 1 as part of the become-leader transition for 1 partitions (state.change.logger)
kafka                   | [2022-06-08 20:11:53,330] INFO [LogLoader partition=test_1ab960c172c84b5caf31c969d87c5a4f-0, dir=/bitnami/kafka/data] Loading producer state till offset 0 with message format version 2 (kafka.log.UnifiedLog$)
kafka                   | [2022-06-08 20:11:53,331] INFO Created log for partition test_1ab960c172c84b5caf31c969d87c5a4f-0 in /bitnami/kafka/data/test_1ab960c172c84b5caf31c969d87c5a4f-0 with properties {} (kafka.log.LogManager)
kafka                   | [2022-06-08 20:11:53,331] INFO [Partition test_1ab960c172c84b5caf31c969d87c5a4f-0 broker=1] No checkpointed highwatermark is found for partition test_1ab960c172c84b5caf31c969d87c5a4f-0 (kafka.cluster.Partition)
kafka                   | [2022-06-08 20:11:53,331] INFO [Partition test_1ab960c172c84b5caf31c969d87c5a4f-0 broker=1] Log loaded for partition test_1ab960c172c84b5caf31c969d87c5a4f-0 with initial high watermark 0 (kafka.cluster.Partition)
kafka                   | [2022-06-08 20:11:53,331] INFO [Broker id=1] Leader test_1ab960c172c84b5caf31c969d87c5a4f-0 starts at leader epoch 0 from offset 0 with high watermark 0 ISR [1] addingReplicas [] removingReplicas []. Previous leader epoch was -1. (state.change.logger)
kafka                   | [2022-06-08 20:11:53,348] INFO [Broker id=1] Finished LeaderAndIsr request in 20ms correlationId 1317 from controller 1 for 1 partitions (state.change.logger)
kafka                   | [2022-06-08 20:11:53,349] INFO [Broker id=1] Add 1 partitions and deleted 0 partitions from metadata cache in response to UpdateMetadata request sent by controller 1 epoch 1 with correlation id 1318 (state.change.logger)

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:18 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
mhowlettcommented, Jun 17, 2022

getting Broker: Unknown topic or partition after you’ve waited for topic creation on a single broker cluster seems very odd

0reactions
niemyjskicommented, Feb 24, 2023

Latest 2.0.2 release seems better but still getting failures

43:46.62225 E:KafkaMessageBus - Error consuming test_ef3b4c5799cd43a5b174f2ed3aee1d43 GroupId=5e9f8df1a6604840bc4ed9642045cb43 message: Failed to query logical offset END: Broker: Unknown topic or partition 43:46.62443 E:KafkaMessageBus - Error consuming test_ef3b4c5799cd43a5b174f2ed3aee1d43 GroupId=0843dbcf82254112bf558efef3a773f4 message: Failed to query logical offset END: Broker: Unknown topic or partition 43:46.62447 E:KafkaMessageBus - Error consuming test_ef3b4c5799cd43a5b174f2ed3aee1d43 GroupId=51da660955af4f0aabf1a3838550b0b1 message: Failed to query logical offset END: Broker: Unknown topic or partition 43:46.62448 E:KafkaMessageBus - Error consuming test_ef3b4c5799cd43a5b174f2ed3aee1d43 GroupId=9dee71745289471aadb14bf5ccc874bc message: Failed to query logical offset END: Broker: Unknown topic or partition 43:46.62449 E:KafkaMessageBus - Error consuming test_ef3b4c5799cd43a5b174f2ed3aee1d43 GroupId=468a61b9b37641b092f930e799057293 message: Failed to query logical offset END: Broker: Unknown topic or partition 43:46.62450 E:KafkaMessageBus - Error consuming test_ef3b4c5799cd43a5b174f2ed3aee1d43 GroupId=d65ac7ebe61841558e02e80385800493 message: Failed to query logical offset END: Broker: Unknown topic or partition 43:46.92477 E:KafkaMessageBus - Error consuming test_ef3b4c5799cd43a5b174f2ed3aee1d43 GroupId=d65ac7ebe61841558e02e80385800493 message: Failed to query logical offset END: Broker: Unknown topic or partition

Read more comments on GitHub >

github_iconTop Results From Across the Web

What does it mean when I make a call and get the ...
Equally if the customer has cancelled their subscription, any calls to that line after it's been cancelled may get a 'not in service'...
Read more >
Kafka - not all consumers receive subscribed message
If all your consumers have the same consumer group ( group.id property) then only one consumer from the group will receive the message....
Read more >
Oracle Advanced Queue Subscribers/Consumers suddenly ...
Oracle Advanced Queue Subscribers/Consumers suddenly stopped working. ... Enqueue works. All records go into READY status. The queue record shows ...
Read more >
Subscriber not in service.
When people try to call me from a cell phone they get an error message "the subscriber is not in service." And my...
Read more >
3 Reasons Subscription Services Fail
A subscription is not defined by recurring revenue alone. Rentals, leases, and memberships generate recurring revenue, but none are subscription ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found