question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Flakiness issues with subscriptionKeySharedUseConsistentHashing=true / PIP-119 in CPP tests

See original GitHub issue

Describe the bug

Quoting @BewareMyPower from #13963

I tried to use three Java consumers with Key_Shared subscription to consume the topic produced by C++ test KeySharedConsumerTest.testMultiTopics. Sometimes not all messages can be received as well. It looks like there is something wrong with the consistent hashing implementation of Key_Shared dispatcher.

I also made similar observations based on C++ test logs:

Example of failures:

FAILED TESTS (3/279):
    9941 ms: ./main KeySharedConsumerTest.testMultiTopics (try #1)
    6242 ms: ./main KeySharedConsumerTest.testKeyBasedBatching (try #1)
    9740 ms: ./main KeySharedConsumerTest.testMultiTopics (try #2)

full logs in https://github.com/apache/pulsar/suites/5064608592/artifacts/150614790

2022-01-26 11:26:24.372 INFO  [140238723073792] MultiTopicsConsumerImpl:95 | Successfully Subscribed to Topics
2022-01-26 11:26:33.950 INFO  [140238845213440] KeySharedConsumerTest:124 | messagesPerConsumer: {0 => 1098, 1 => 811, 2 => 1027}
/pulsar/pulsar-client-cpp/tests/KeySharedConsumerTest.cc:129: Failure
Value of: expectedNumTotalMessages
  Actual: 3000
Expected: numTotalMessages
Which is: 2936
2022-01-26 11:26:33.951 INFO  [140238845213440] ClientImpl:496 | Closing Pulsar client with 3 producers and 3 consumers

To Reproduce Steps to reproduce the behavior:

  1. Set subscriptionKeySharedUseConsistentHashing=true
  2. Produce messages to multiple topics using key shared
  3. Consume messages

Expected behavior There shouldn’t be any message loss

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:16 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
BewareMyPowercommented, Jan 26, 2022

Yeah, here is a screenshot of my Java consumer application. The topic was created by C++ UT and received 3000 messages from C++ producer. Java consumers should have received 3000 messages in total, and sometimes it works well.

image

The code is

    private static int receive(Consumer<byte[]> consumer) throws PulsarClientException {
        int n = 0;
        while (true) {
            final Message<byte[]> msg = consumer.receive(2, TimeUnit.SECONDS);
            if (msg == null) {
                break;
            }
            n++;
            System.out.println("Received " + new String(msg.getValue())
                    + " from " + msg.getMessageId() +  ", key: " + msg.getKey());
        }
        return n;
    }

    public static void main(String[] args) throws PulsarClientException {
        final PulsarClient client = PulsarClient.builder().serviceUrl("pulsar://localhost:6650").build();
        final ConsumerBuilder<byte[]> builder = client.newConsumer()
                .topicsPattern(".*KeySharedConsumerTest-multi-topics.*")
                .subscriptionInitialPosition(SubscriptionInitialPosition.Earliest)
                .subscriptionType(SubscriptionType.Key_Shared)
                .subscriptionName("my-sub-1");
        final Consumer<byte[]> consumer1 = builder.clone().subscribe();
        final Consumer<byte[]> consumer2 = builder.clone().subscribe();
        final Consumer<byte[]> consumer3 = builder.clone().subscribe();
        int n1 = receive(consumer1);
        int n2 = receive(consumer2);
        int n3 = receive(consumer3);
        System.out.println("n1: " + n1 + ", n2: " + n2 + ", n3: " + n3 + ", total: " + (n1 + n2 + n3));
        client.close();
    }

But I cannot reproduce it with Java UT easily at the moment.

0reactions
lhotaricommented, Jan 31, 2022

I just checked, adding consumer acks, and the messages are there in the backlog, so this shouldn’t be characterized as “message loss”.

I agree. I renamed the issue. I’ll close this issue since it seems to be addressed. @BewareMyPower Please reopen if there’s more to do.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Fix Flaky Tests - Semaphore CI
Randomly failing tests are the hardest to debug. Here's a framework you can use to fix them and keep your test suite healthy....
Read more >
Probabilistic flakiness: How do you test your tests?
The probabilistic flakiness score helps us measure and monitor test reliability and quickly adapt to any changes over time.
Read more >
Fix your flaky tests problem - Undo.io
Eliminate flaky test failures with Software Failure Replay. Spend time eliminating flaky tests not investigating them. Fix intermittent failures fast.
Read more >
What are Flaky Tests? | TeamCity CI/CD Guide - JetBrains
Flaky tests are tests that return new results, despite there being no changes to code. Find out why flaky tests matter and how...
Read more >
Flaky tests - GitLab Docs
Usually, running the test locally several times would reproduce the problem. Resolution: Depending on the problem, you might want to: loosen the assertion...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found