Axon hangs after period of slow DB responses
See original GitHub issueBasic information
On production our DB cluster was impacted by series of expansive queries resulting in period of slow and unreliable DB responses (not caused by axon). During that time when Axon received input to process it emitted first command but no further processing happened. After DB lag normalised Axon did not resume the processing.
- Axon Framework version: 4.5.3
- JDK version: 11.0.11
- PostgreSQL DB
- In prod we are running 2 instances with following configuration:
fun deadlineManager(
scheduler: Scheduler,
configuration: AxonConfiguration,
transactionManager: TransactionManager
): DeadlineManager = QuartzDeadlineManager.builder()
.scheduler(scheduler)
.transactionManager(transactionManager)
.scopeAwareProvider(ConfigurationScopeAwareProvider(configuration))
.refireImmediatelyPolicy { false }
.build()
@Bean
fun axonWorkerThreadPool(axonProcessorsProperties: AxonProcessorsProperties): ScheduledExecutorService {
return Executors.newScheduledThreadPool(
axonProcessorsProperties.workerThreadCount,
AxonThreadFactory("Worker")
)
}
@Autowired
@Suppress("LongMethod")
fun registerPooledStreamingProcessorConfig(
processingConfigurer: EventProcessingConfigurer,
axonWorkerThreadPool: ScheduledExecutorService
) {
processingConfigurer.usingPooledStreamingEventProcessors()
processingConfigurer
.registerPooledStreamingEventProcessorConfiguration(
SAGA_EVENT_PROCESSOR_1,
createProcessorConfiguration(
"Coordinator 1",
axonWorkerThreadPool,
1
)
)
.registerPooledStreamingEventProcessorConfiguration(
SAGA_EVENT_PROCESSOR_2,
createProcessorConfiguration(
"Coordinator 2",
axonWorkerThreadPool,
1
)
)
.registerPooledStreamingEventProcessorConfiguration(
SAGA_EVENT_PROCESSOR_3,
createProcessorConfiguration(
"Coordinator 3",
axonWorkerThreadPool,
1
)
)
.registerPooledStreamingEventProcessorConfiguration(
DOMAIN_EVENT_PROCESSOR,
createProcessorConfiguration(
"Coordinator domain event",
axonWorkerThreadPool,
32
)
)
}
fun createProcessorConfiguration(
coordinatorName: String,
axonWorkerThreadPool: ScheduledExecutorService,
initialSegmentCount: Int
): PooledStreamingProcessorConfiguration {
return PooledStreamingProcessorConfiguration { config, builder ->
builder.coordinatorExecutor(
Executors.newScheduledThreadPool(
coordinatorThreadCountCount, AxonThreadFactory(coordinatorName)
)
)
.workerExecutor(axonWorkerThreadPool)
.initialSegmentCount(initialSegmentCount)
}
}
Steps to reproduce
- use postgresql DB running in docker container
- use Pumba (https://github.com/alexei-led/pumba) to slow down connection to the DB
-
- Install iproute2 on DB container:
docker exec -it postgresdb sh -c "apt update && apt install -y iproute2
- Install iproute2 on DB container:
-
- Simulate DB lag:
pumba netem --duration 5m --interface eth0 delay --time 3000 --jitter 30 --correlation 20 postgresdb
- Simulate DB lag:
- Run some load on application
Expected behaviour
After DB lag normalizes Axon proceeds with regular processing.
Actual behaviour
Request handler creates command (eg. StartProcessingCommand), aggregate is being created, and it sends ProcessingStartedEvent. The event is persisted in DB and should be processed by SAGA_EVENT_PROCESSOR_1 but is never picked for processing.
Please note: this is acceptable that application rejects some requests while db is under-performing or accepts the request but the processing does not start right away. The point is that accepted requests are not processed even after DB disruption have ended.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Thanks for your insights, @pawel-lankocz. I’ll see what we can do to provide more guidance on the subject. Happy coding with Axon and hope to chat with you in the future!
Hello, Indeed it seems like problem caused on Hikari level, and I don’t think axon is to be blamed.
Regarding the documentation I found following description very informative:
In my opinion similar description in documentation could prove beneficial. Similarly, recommendation on why and how to structure the database (or how to split the database into event (and snapshots) store DB and “rest”) could be useful.
Again thank you for your assistance @smcvb!