(Flaky-test) Intermittent failure of AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup()
See original GitHub issueI came across this issue locally by running this command:
mvn test '-Dtest=!PersistentTransactionBufferTest,!PulsarFunctionE2ESecurityTest,!ServerCnxTest,!AdminApiOffloadTest,!AdminApiSchemaValidationEnforced,!V1_AdminApiTest2,!ProxyPublishConsumeTlsTest,!PulsarFunctionE2ETest,!MessageIdSerialization,!AdminApiTest2,!PulsarFunctionLocalRunTest,!PartitionedProducerConsumerTest,!KafkaProducerSimpleConsumerTest,!MessagePublishThrottlingTest,!ReaderTest,!RackAwareTest,!SimpleProducerConsumerTest,!V1_ProducerConsumerTest,!PersistentFailoverE2ETest,!BrokerClientIntegrationTest,!ReplicatorRateLimiterTest' -DfailIfNoTests=false -pl pulsar-broker
(This is the command used to run the Github CI Unit Broker test.)
[ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 38.574 s <<< FAILURE! - in org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest
[ERROR] testBrokerSelectionForAntiAffinityGroup(org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest) Time elapsed: 0.229 s <<< FAILURE!
java.lang.AssertionError: null
at org.testng.Assert.fail(Assert.java:96)
at org.testng.Assert.assertNotEquals(Assert.java:1157)
at org.testng.Assert.assertNotEquals(Assert.java:1162)
at org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup(AntiAffinityNamespaceGroupTest.java:425)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
However, I am having trouble reproducing this issue by running just the test that failed, so perhaps there is a concurrency issue with another test.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
nice work @michaeljmarshall
The fundamental problem with this test is that the broker’s implementation of anti-affinity does not guarantee that two namespaces in the same anti-affinity group will be placed on different broker nodes. When all of the preferred brokers are overloaded, the
selectBrokerForAssignment
method will choose a broker that has a namespace in the same anti-affinity group. However, the test essentially asserts that two namespaces in the same anti-affinity group should never get placed on the same broker.I propose that we set the overloaded threshold config high enough that it won’t allow for the preferred broker to be overridden. Additional detail on how I reached this conclusion follows.
I ran
$ mvn clean test -Dtest=AntiAffinityNamespaceGroupTest -DfailIfNoTests=false -pl pulsar-broker
locally until I got a failure. This snippet from the failed test’s logs show the essential information to expose the problem:The test fails with the following error:
Given the second log line
Selected broker Optional[localhost:57033] from candidate brokers [localhost:57033]
, it’s clear that the anti-affinity was applied, but the default overloaded threshold of 85% led the method to override the preferred broker, which led to the test’s failure.First, I don’t think we want to remove this test. It adds value by testing
selectBrokerForAssignment
’s usage of the anti-affinity method (LoadManagerShared.filterAntiAffinityGroupOwnedBrokers
). Here is a reference to theselectBrokerForAssignment
method: https://github.com/apache/pulsar/blob/d40fb5837fe33fb9f7630f7674e6953bb63ca164/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/ModularLoadManagerImpl.java#L763-L861Instead, I think it would make the most sense to set the
overloadedThreshold
high enough that it won’t allow the preferred broker to be overridden. In this way, we remove the main caveat on the anti-affinity logic because no broker will be too overloaded. Note that the other caveat is that a second broker has to be available, but I don’t believe that has been a problem.For reference, here are the full logs from the test: