question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SIP-26] Proposal for Implementing Connection Pooling for Analytics Database Connections

See original GitHub issue

[SIP] Proposal for Implementing Connection Pooling for Analytics Database Connections

Motivation

Currently, Superset’s connections to analytics databases do not have long-lived connection pools. In most instances, a database connection is spawned immediately before a query is executed and discarded after a single use. This introduces a small amount of latency into every query. While most queries run against data warehouses are expected to be longer-running than a typical web application query, this latency will be noticeable when performing operations such as loading schema and table lists for display in the UI, or loading table definitions and previews. A more serious concern is that the number of open database connections to analytics databases is only bounded by the number of threads available to the application across all processes. Under peak load, this can lead to hammering databases with a large number of connection requests and queries simultaneously. This does not allow us to provide meaningful upper bounds for the number of available database connections. Implementing connection pooling at the process level will allow us to provide a configurable maximum number of connections that Superset is able to leverage.

Proposed Change

I recommend we add a singleton object to hold a SQLAlchemy Engine instance for each configured database in the application. I believe that engines should not be instantiated on startup, but instead instantiated on first use to avoid unnecessary connection negotiation.

I further recommend that we use the SQLAlchemy QueuePool as the default pool implementation while retaining the ability to configure Superset to use a NullPool, configurable via the Database setup system. I would like to make the pool_size and max_overflow properties configurable, as well as whether to treat the queue as FIFO or LIFO and the pool_pre_ping option, and customization of the connect_args passed on engine instantiation (which controls things like connection timeouts). I believe that LIFO queues will be preferable for infrequently-accessed database connections, as they will generally maintain a lower number of connections in the pool, and thus should be the default. I would also recommend that for LIFO queues we default to the pool_pre_ping option to trigger pool member invalidation when necessary, as stale connections are more likely under the LIFO configuration.

As part of this work, I recommend moving engine instantiation code out of the Database model and into its own module, probably as part of the singleton that will maintain an in-memory list of database pools. We will need to update the code that alters database records to reinitialize the processes’ engine after Database record creation and update.

One further change will be in regards to Celery’s connection pooling. Right now, we use the NullPool in Celery and instantiate database connections when needed. For Celery, I would recommend moving to the StaticPool, which will create one database connection per worker process. Because Celery reuses worker processes, this will reduce the overhead on backgrounded queries. An alternative would be to move to threaded workers (gevent or eventlet) and maintain the same pool configuration as the UI. I’d love suggestions from the community on what to recommend here.

New or Changed Public Interfaces

This change should have minimal impact on the UI, the primary change being the addition of more configuration options in the Databases section. I would recommend having sensible defaults and hiding the pool setup under an Advanced configuration section. I plan to provide guidance on the meaning of the pool_size, max_overflow, and FIFO vs LIFO configuration parameters, both in the UI and in new documentation. The configuration approach will be hybrid, allowing global configuration of defaults in config.py, with overrides available on a per-database basis in the UI.

New dependencies

No additional dependencies will be necessary.

Migration Plan and Compatibility

A database migration will be necessary to add an additional field to the DBs table to hold connection pooling arguments.

No URLs will change as part of this work. I would like feedback from the community, particularly engineers at Airbnb, Lyft, and other organizations with large Superset installs, on what sensible defaults for connection pools would look like.

Rejected Alternatives

The primary alternative rejected is the current, connection-pool-less state. While this state allows for only the number of connections needed at any given time to be in use, it falls down with regards to performance and predictability of number of open connections at any given time.

I also considered the other connection pool implementations in SQLAlchemy, but it appears that our use-case is best served by the QueuePool implementation.

One additional piece I considered was providing an option to the user of configuring an overall, rather than per-process, maximum number of connections. In that case, processes would need to “check out” the ability to make a connection from a distributed lock built in Redis, or the max size would need to be large enough to provide at least one connection per live process. While I think this would be a better experience for most users, I’m concerned about the additional application complexity required by such a change. Would processes need to register themselves in Redis on boot so we could get a correct count of the number of live processes? What happens when we need to scale up beyond the global maximum number of database connections? I think solving those problems is not easy, and most use-cases will be well-enough served by a per-process max number of connections.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:10
  • Comments:22 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
willbarrettcommented, Feb 24, 2020

@villebro completely agreed RE: connection pooling everywhere being desirable. Given the structure of the system though, pooling for the metadata database is a special case. Connection pooling should already be in use for foreground connections, but Celery presents special concerns, which is why I’d like to treat it separately. We definitely share the same end goal though!

1reaction
mistercrunchcommented, Nov 20, 2019

Another thing to cover as part of this sip is the configurability of the pools per database. Potentially heterogenous params based on pool type, it’s hard to come up with a balance between something comprehensive and static VS flexible. We could have both: maybe different presets in a dropdown list, and the possibity to override with actual pool objects in a dict in superset_config.py …

Planning on making another comment addressing your points above.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[SIP-26] Proposal for Implementing Connection Pooling for ...
A more serious concern is that the number of open database connections to analytics databases is only bounded by the number of threads...
Read more >
What is connection pooling, and why should you care
Database connection pooling is a way to reduce the cost of opening and closing connections by maintaining a “pool” of open connections that...
Read more >
A Simple Guide to Connection Pooling in Java - Baeldung
A quick overview of several popular connection pooling solutions, plus a quick dive into a custom connection pool implementation.
Read more >
Database connection pooling | Looker - Google Cloud
This option lets Looker use pools of connections through the JDBC driver. Database connection pooling enables faster query performance; ...
Read more >
Using Amazon RDS Proxy - AWS Documentation
Learn about Amazon RDS Proxy. RDS Proxy is a fully managed, highly available database proxy that uses connection pooling to share database connections...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found