question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large number of unstable Github tests for Pulsar

See original GitHub issue

Description Pulsar tests regularly succeed locally, but after submitting a Pull Request to merge changes into master, some of the tests (usually < 3 at a time) will randomly fail.

Examples include, but are not limited to:

In CI - CPP, Python Tests / cpp-tests:

  • BasicEndToEndTest.testPatternEmptyUnsubscribe
    
  • BasicEndToEndTest.testSinglePartitionRoutingPolicy
    

In CI - Unit - Brokers:

  • org.apache.pulsar.client.api.SimpleProducerConsumerTest
      -     org.apache.pulsar.client.api.SimpleProducerConsumerTest.setup
    
  • org.apache.pulsar.client.impl.BrokerClientIntegrationTest
      -     testUnsupportedBatchMessageConsumer
            -     which gave: BrokerClientIntegrationTest.testUnsupportedBatchMessageConsumer:388->ProducerConsumerBase.testMessageOrderAndDuplicates:57 Received message my-message-7 did not match the expected message my-message-0 expected [my-message-0] but found [my-message-7]
    
  • org.apache.pulsar.client.api.PartitionCreationTest
      -     testCreateConsumerForPartitionedTopicUpdateWhenDisableTopicAutoCreation
    
  • org.apache.pulsar.client.impl.ReaderTest
      -     testReadMessageWithBatchingWithMessageInclusive
            -     which gave: java.lang.AssertionError: expected [true] but found [false]
    

In CI - Unit - Flaky:

  • org.apache.pulsar.client.kafka.test.KafkaProducerSimpleConsumerTest
      -     testPulsarKafkaProducerWithSerializer
    
  • org.apache.pulsar.functions.worker.PulsarFunctionE2ESecurityTest
      -     testAuthorizationWithAnonymousUser
    
  • org.apache.pulsar.broker.service.PersistentFailoverE2ETest
      -     testSimpleConsumerEventsWithoutPartition
             -     which gave: java.lang.AssertionError: expected [null] but found [-1]
    

In CI - Unit - Proxy:

  • org.apache.pulsar.proxy.server.ProxyParserTest
      -     org.apache.pulsar.proxy.server.ProxyParserTest.testRegexSubscription
             -    which gave: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException: Disconnected from server at fv-az98.onfd2ysmc4sedambc1t2u4afph.cx.internal.cloudapp.net/10.1.0.4:33515
    

In CI - Integration - Function State:

  • org.apache.pulsar.tests.integration.functions.PulsarStateTest
      -     org.apache.pulsar.tests.integration.functions.PulsarStateTest.pulsar-standalone-suite
      -     PulsarStateTest.testPythonWordCountFunction
             -     which gave: PulsarStateTest.testPythonWordCountFunction:78->publishAndConsumeMessages:410 » ThreadTimeout
      -     PulsarStateTest.testSinkState
             -     which gave: PulsarStateTest.testSinkState:183 expected [val1-9] but found [val1-8]
    

In CI - Unit - Adapters:

  • org.apache.pulsar.storm.PulsarBoltTest
      -     beforeMethod
             -     which gave: java.lang.IllegalStateException: Failed to initialize producer for persistent://my-property/my-ns/my-topic1 : HTTP get request failed: Internal Server Error
    

Regarding org.apache.pulsar.client.api.SimpleProducerConsumerTest.setup, I found an interesting exception message:

org.apache.pulsar.client.api.SimpleProducerConsumerTest.setup(org.apache.pulsar.client.api.SimpleProducerConsumerTest)
165[ERROR] Run 1: SimpleProducerConsumerTest.setup:108->MockedPulsarServiceBaseTest.internalSetup:107->MockedPulsarServiceBaseTest.init:144->MockedPulsarServiceBaseTest.startBroker:195->MockedPulsarServiceBaseTest.startBroker:218 » WrongTypeOfReturnValue

This message suggests that there’s a race condition in the testing framework (or our use of it). Perhaps there are known concurrency bugs in some of the versions of the test libraries we are using.

Each time the tests are run (e.g. for PR #6031), different tests fail. There seems to be no consistency to them at all.

There is also a risk that there are concurrency bugs in the actual framework that only appear in certain environments. If this is the case, then these bugs could result in instability for certain users in production environments.

Expected behavior Tests should not randomly fail when run by Jenkins or the Github CI Action test runner after submitting a Pull Request. These random failures significantly slow down the rate of being able to merge PRs and raise the possibility of other potential risks.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
devinbostcommented, Jan 30, 2020

After discovering quite a few closed issues involving intermittent tests, I realized that almost all of them involved race conditions or concurrency issues involving shared state. So, I’ll start looking at these failing tests in greater depth individually. I’m hoping that some common causes will emerge, but so far, it looks like each one of the solved intermittent tests has been a unique case.

1reaction
devinbostcommented, Feb 3, 2020

I found an interesting study that examined common causes of Flaky tests: http://mir.cs.illinois.edu/~qluo2/fse14LuoHEM.pdf

Read more comments on GitHub >

github_iconTop Results From Across the Web

CI cpp-tests is broken now for the latest pulsar-build image
Currently, the CI for cpp tests is broken because it uses the apache/pulsar-build:ubuntu16.04 image as the base image to build cpp client.
Read more >
Flaky-test: PerformanceTransactionTest ... - GitHub
Flaky-test: org.apache.pulsar.testclient. ... Number of failures: 5 ... This is an unstable test caused by message retransmissions.
Read more >
Fix Prometheus Tests · Issue #6256 · apache/pulsar · GitHub
I'm planning on splitting that big PR into smaller ones.
Read more >
PIP 72: Introduce Pulsar Interface Taxonomy: Audience and ...
Apache Pulsar - distributed pub-sub messaging system - PIP 72: Introduce ... An Unstable interface is one for which no compatibility guarantees are...
Read more >
Broker hangs and crashes when listing non-persistent topics
Describe the bug On a Pulsar cluster w/ versions 2.3.0 or 2.4.1 when I send ... the broker due to the high number...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found