[SIP-26] Proposal for Implementing Connection Pooling for Analytics Database Connections
See original GitHub issue[SIP] Proposal for Implementing Connection Pooling for Analytics Database Connections
Motivation
Currently, Superset’s connections to analytics databases do not have long-lived connection pools. In most instances, a database connection is spawned immediately before a query is executed and discarded after a single use. This introduces a small amount of latency into every query. While most queries run against data warehouses are expected to be longer-running than a typical web application query, this latency will be noticeable when performing operations such as loading schema and table lists for display in the UI, or loading table definitions and previews. A more serious concern is that the number of open database connections to analytics databases is only bounded by the number of threads available to the application across all processes. Under peak load, this can lead to hammering databases with a large number of connection requests and queries simultaneously. This does not allow us to provide meaningful upper bounds for the number of available database connections. Implementing connection pooling at the process level will allow us to provide a configurable maximum number of connections that Superset is able to leverage.
Proposed Change
I recommend we add a singleton object to hold a SQLAlchemy Engine instance for each configured database in the application. I believe that engines should not be instantiated on startup, but instead instantiated on first use to avoid unnecessary connection negotiation.
I further recommend that we use the SQLAlchemy QueuePool as the default pool implementation while retaining the ability to configure Superset to use a NullPool, configurable via the Database setup system. I would like to make the pool_size
and max_overflow
properties configurable, as well as whether to treat the queue as FIFO or LIFO and the pool_pre_ping
option, and customization of the connect_args
passed on engine instantiation (which controls things like connection timeouts). I believe that LIFO queues will be preferable for infrequently-accessed database connections, as they will generally maintain a lower number of connections in the pool, and thus should be the default. I would also recommend that for LIFO queues we default to the pool_pre_ping
option to trigger pool member invalidation when necessary, as stale connections are more likely under the LIFO configuration.
As part of this work, I recommend moving engine instantiation code out of the Database model and into its own module, probably as part of the singleton that will maintain an in-memory list of database pools. We will need to update the code that alters database records to reinitialize the processes’ engine after Database record creation and update.
One further change will be in regards to Celery’s connection pooling. Right now, we use the NullPool in Celery and instantiate database connections when needed. For Celery, I would recommend moving to the StaticPool, which will create one database connection per worker process. Because Celery reuses worker processes, this will reduce the overhead on backgrounded queries. An alternative would be to move to threaded workers (gevent or eventlet) and maintain the same pool configuration as the UI. I’d love suggestions from the community on what to recommend here.
New or Changed Public Interfaces
This change should have minimal impact on the UI, the primary change being the addition of more configuration options in the Databases section. I would recommend having sensible defaults and hiding the pool setup under an Advanced
configuration section. I plan to provide guidance on the meaning of the pool_size
, max_overflow
, and FIFO vs LIFO configuration parameters, both in the UI and in new documentation. The configuration approach will be hybrid, allowing global configuration of defaults in config.py
, with overrides available on a per-database basis in the UI.
New dependencies
No additional dependencies will be necessary.
Migration Plan and Compatibility
A database migration will be necessary to add an additional field to the DBs table to hold connection pooling arguments.
No URLs will change as part of this work. I would like feedback from the community, particularly engineers at Airbnb, Lyft, and other organizations with large Superset installs, on what sensible defaults for connection pools would look like.
Rejected Alternatives
The primary alternative rejected is the current, connection-pool-less state. While this state allows for only the number of connections needed at any given time to be in use, it falls down with regards to performance and predictability of number of open connections at any given time.
I also considered the other connection pool implementations in SQLAlchemy, but it appears that our use-case is best served by the QueuePool implementation.
One additional piece I considered was providing an option to the user of configuring an overall, rather than per-process, maximum number of connections. In that case, processes would need to “check out” the ability to make a connection from a distributed lock built in Redis, or the max size would need to be large enough to provide at least one connection per live process. While I think this would be a better experience for most users, I’m concerned about the additional application complexity required by such a change. Would processes need to register themselves in Redis on boot so we could get a correct count of the number of live processes? What happens when we need to scale up beyond the global maximum number of database connections? I think solving those problems is not easy, and most use-cases will be well-enough served by a per-process max number of connections.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:10
- Comments:22 (20 by maintainers)
Top GitHub Comments
@villebro completely agreed RE: connection pooling everywhere being desirable. Given the structure of the system though, pooling for the metadata database is a special case. Connection pooling should already be in use for foreground connections, but Celery presents special concerns, which is why I’d like to treat it separately. We definitely share the same end goal though!
Another thing to cover as part of this sip is the configurability of the pools per database. Potentially heterogenous params based on pool type, it’s hard to come up with a balance between something comprehensive and static VS flexible. We could have both: maybe different presets in a dropdown list, and the possibity to override with actual pool objects in a dict in superset_config.py …
Planning on making another comment addressing your points above.