ThreadLocalPool (reactive) leaks connections upon Thread Scale up and down
See original GitHub issueDescribe the bug The ThreadlocalPool and derived variant ThreadLocalPgPool leak connections when quarkus is scaling down threads (after idling). Root cause is, that the thread locals are additionally kept in the threadLocalPools list in io.quarkus.reactive.datasource.runtime.ThreadLocalPool. When quarkus is scaling up worker threads under load, additional ThreadLocal instances are spawned (understoodable). But as soon as the thread pool shrinks, the previous thread local instances are still in the threadLocalPools list. All connections in there are also kept open towards the DB. When the the engine gets under load again and spawns new threads, new Threadlocal pool instances are created. This breaks as soon as the underlying DB is running out of connections.
Expected behavior ThreadLocal pools (not the pool itself, but the ThreadLocal “partitions” of the pool should be closes properly including their connections, when the Threads are scaled down.
Actual behavior As mentioned above, the pool creates new ThreadPool “partitions” for each newly spawner worker thread plus corresponding DB Connections
To Reproduce
Steps to reproduce the behavior:
- deploy some bean with DB interaction (simple query) with hibernate reactive
- put the system under load; should be so much load, that the DB Pool is saturated
- Cross check open connection on the DB (e.g. Postges via SELECT * FROM pg_stat_activity;); should show pool size * worker threads count of open connections
- leave the system in idle and wait until the worker threads are reduced by quarkus
- Repeat the load test
- Cross check open connection on the DB (e.g. Postges via SELECT * FROM pg_stat_activity;); should show pool size * worker threads count of open connections -> the number of connections is doubled
Configuration
application.properties
quarkus.hibernate-orm.database.generation=create
quarkus.datasource.db-kind=postgresql
quarkus.datasource.username=***
quarkus.datasource.password=***
quarkus.datasource.reactive=true
quarkus.datasource.reactive.url=postgresql://localhost:5434/****
quarkus.hibernate-orm.log.sql=false
quarkus.datasource.reactive.max-size=10
quarkus.thread-pool.max-threads=32
quarkus.datasource.reactive.cache-prepared-statements=true
quarkus.datasource.reactive.postgresql.pipelining-limit=256
Environment (please complete the following information):
- Output of
uname -a
orver
: Darwin localhost 20.2.0 Darwin Kernel Version 20.2.0: Wed Dec 2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64 x86_64 - Output of
java -version
: java version “11.0.2” 2019-01-15 LTS - GraalVM version (if different from Java):
- Quarkus version or git rev: 1.11.0.Beta2
- Build tool (ie. output of
mvnw --version
orgradlew --version
):
Gradle 6.5.1
Build time: 2020-06-30 06:32:47 UTC Revision: 66bc713f7169626a7f0134bf452abde51550ea0a
Kotlin: 1.3.72 Groovy: 2.5.11 Ant: Apache Ant™ version 1.10.7 compiled on September 1 2019 JVM: 11.0.2 (Oracle Corporation 11.0.2+9-LTS) OS: Mac OS X 10.16 x86_64
Workaroung for me at the moment is to set quarkus.thread-pool.core-threads and quarkus.thread-pool.max-threads to the same value in order to prevent any up and down scaling of the ThreadPool (e.e. quarkus.thread-pool.core-threads=32 quarkus.thread-pool.max-threads=32)
Issue Analytics
- State:
- Created 3 years ago
- Comments:27 (18 by maintainers)
@voigtste might want to play with https://github.com/quarkusio/quarkus/pull/14102
It does NOT address:
but if you use Hibernate Reactive over its own API and without opening a Session directly, it will delegate work to the right context. You can set
-Dorg.hibernate.reactive.common.InternalStateAssertions.ENFORCE=true
to have it throw exceptions when one of the other APIs is used from the wrong thread (and it would fail with Panache Reactive if you use it at all from the wrong thread).Assuming you don’t access the SQL Pool directly from a worker pool thread, and your code doesn’t fail when run with that flag enabled, it should avoid the connection leak as well.
I didn’t want to spend too much time on this (as I don’t have much time and should really study vert.x 4 for a long term optimal solution), but since a leak isn’t acceptable to me I’m sending a PR with an alternative solution.
Essentially, while it would be more complicated to immediately cleanup references on scale down, it’s quite straight forward to check for zombies when we scale in the other direction. This implies that while resources won’t be immediately released, the maximum cost will - in worst scenario - only match the cost you’d have at maximum scale up of the pool.
So it’s not optimal yet but at least it’s not a leak which risks taking down a system, as a periodic scale up/down won’t result in additional memory costs.