question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Axon hangs after period of slow DB responses

See original GitHub issue

Basic information

On production our DB cluster was impacted by series of expansive queries resulting in period of slow and unreliable DB responses (not caused by axon). During that time when Axon received input to process it emitted first command but no further processing happened. After DB lag normalised Axon did not resume the processing.

  • Axon Framework version: 4.5.3
  • JDK version: 11.0.11
  • PostgreSQL DB
  • In prod we are running 2 instances with following configuration:
    fun deadlineManager(
        scheduler: Scheduler,
        configuration: AxonConfiguration,
        transactionManager: TransactionManager
    ): DeadlineManager = QuartzDeadlineManager.builder()
        .scheduler(scheduler)
        .transactionManager(transactionManager)
        .scopeAwareProvider(ConfigurationScopeAwareProvider(configuration))
        .refireImmediatelyPolicy { false }
        .build()

    @Bean
    fun axonWorkerThreadPool(axonProcessorsProperties: AxonProcessorsProperties): ScheduledExecutorService {
        return Executors.newScheduledThreadPool(
            axonProcessorsProperties.workerThreadCount,
            AxonThreadFactory("Worker")
        )
    }

    @Autowired
    @Suppress("LongMethod")
    fun registerPooledStreamingProcessorConfig(
        processingConfigurer: EventProcessingConfigurer,
        axonWorkerThreadPool: ScheduledExecutorService
    ) {
        processingConfigurer.usingPooledStreamingEventProcessors()
        processingConfigurer
            .registerPooledStreamingEventProcessorConfiguration(
                SAGA_EVENT_PROCESSOR_1,
                createProcessorConfiguration(
                    "Coordinator 1",
                    axonWorkerThreadPool,
                    1
                )
            )
            .registerPooledStreamingEventProcessorConfiguration(
                SAGA_EVENT_PROCESSOR_2,
                createProcessorConfiguration(
                    "Coordinator 2",
                    axonWorkerThreadPool,
                    1
                )
            )
            .registerPooledStreamingEventProcessorConfiguration(
                SAGA_EVENT_PROCESSOR_3,
                createProcessorConfiguration(
                    "Coordinator 3",
                    axonWorkerThreadPool,
                    1
                )
            )
            .registerPooledStreamingEventProcessorConfiguration(
                DOMAIN_EVENT_PROCESSOR,
                createProcessorConfiguration(
                    "Coordinator domain event",
                    axonWorkerThreadPool,
                    32
                )
            )
    }

    fun createProcessorConfiguration(
        coordinatorName: String,
        axonWorkerThreadPool: ScheduledExecutorService,
        initialSegmentCount: Int
    ): PooledStreamingProcessorConfiguration {
        return PooledStreamingProcessorConfiguration { config, builder ->
            builder.coordinatorExecutor(
                Executors.newScheduledThreadPool(
                    coordinatorThreadCountCount, AxonThreadFactory(coordinatorName)
                )
            )
                .workerExecutor(axonWorkerThreadPool)
                .initialSegmentCount(initialSegmentCount)
        }
    }
    

Steps to reproduce

  • use postgresql DB running in docker container
  • use Pumba (https://github.com/alexei-led/pumba) to slow down connection to the DB
    • Install iproute2 on DB container: docker exec -it postgresdb sh -c "apt update && apt install -y iproute2
    • Simulate DB lag: pumba netem --duration 5m --interface eth0 delay --time 3000 --jitter 30 --correlation 20 postgresdb
  • Run some load on application

Expected behaviour

After DB lag normalizes Axon proceeds with regular processing.

Actual behaviour

Request handler creates command (eg. StartProcessingCommand), aggregate is being created, and it sends ProcessingStartedEvent. The event is persisted in DB and should be processed by SAGA_EVENT_PROCESSOR_1 but is never picked for processing.

Please note: this is acceptable that application rejects some requests while db is under-performing or accepts the request but the processing does not start right away. The point is that accepted requests are not processed even after DB disruption have ended.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
smcvbcommented, Mar 25, 2022

Thanks for your insights, @pawel-lankocz. I’ll see what we can do to provide more guidance on the subject. Happy coding with Axon and hope to chat with you in the future!

1reaction
pawel-lankoczcommented, Mar 24, 2022

Hello, Indeed it seems like problem caused on Hikari level, and I don’t think axon is to be blamed.

Regarding the documentation I found following description very informative:

When using Axon Framework, every thread pool per Event Processor would have its own threads.(…) Furthermore, the Coordinator connects to your Event Store. (…) the Coordinator threads will maintain an open connection as long as the processor runs. (…) Adjust the connection pool size of Hikari to match the number of threads actively connecting to query model databases when event handling

In my opinion similar description in documentation could prove beneficial. Similarly, recommendation on why and how to structure the database (or how to split the database into event (and snapshots) store DB and “rest”) could be useful.

Again thank you for your assistance @smcvb!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Hanging system: No more queries processed - Axon Framework
I have a problem with a “hanging” application node which blocks further query processing. We are using Axon Framework v4.5.2 and Axon Server ......
Read more >
Repetitive activity slows axonal conduction velocity and ...
The slowed conduction velocity was determined from the average latency to 1776 action potential responses to electrical stimulation. Monitoring ...
Read more >
Axon Reflex - an overview | ScienceDirect Topics
Milder forms of postganglionic sudomotor dysfunction may be associated with a sustained “hung up” response to stimulation suggesting hyperexcitability or ...
Read more >
The Slow Depolarization Following Individual Spikes in Thin ...
Hyperexcitability in Peripheral Axons​​ These factors can lead to periods of speeding, slowing and failures. Because of the difficulty in ...
Read more >
Wallerian Degeneration - Physiopedia
After a short latency period, the transected membranes are sealed until degeneration which is marked by the formation of axonal sprouts.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found