Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SIP-39] Global Async Query Support

See original GitHub issue

[SIP-39] Global Async Query Support

Motivation

Loading time on dashboards with many charts is excessive due to the number of concurrent connections to the server.
Synchronous connections left open waiting for result data to be returned block other requests to the server.
Browsers limit the number of concurrent connections to the same host to six, causing bottlenecks.
Some of this bottleneck is currently mitigated via the dashboard tabs feature, which defers loading of charts in background tabs. This lazy-loading could also be applied to charts outside of the viewport.
There is a proposal for SQL Lab to move away from tabs in the UI, altering the current async query polling mechanisms.
Provide a standardized query interface for dashboards, charts and SQL Lab queries.

Proposed Change

Provide a configuration setting to enable async data loading for charts (dashboards, Explore) and SQL Lab. Instead of multiple synchronous requests to the server to load dashboard charts, we issue async requests to the server which enqueue a background job and return a job ID. A single persistent websocket connection to the server is opened to listen for results in a realtime fashion.

Websockets via a sidecar application

Pros

Excellent websocket browser support
Realtime synchronous communication channel
Browser connections limits not an issue (255 in Chrome)
Supports bi-directional communication (possible future use cases)
Authentication/authorization via initial HTTP request that is then upgraded to a WSS connection.

Cons

Minor modifications to proxies, load balancers to support websocket connections
- Sticky sessions not required at the load balancer if the reconnection story is sound
Requires persistent connection to the server for each tab
- SharedWorker also evaluated, but support is lacking
AWS “Classic” ELB does not support Websockets. Requires use of NLB (TCP) or ALB in AWS environments.

Superset async query arch - Websockets v2 (2)

Approach

Each open tab of Superset would create a unique “channel” ID to subscribe to. A websocket connection is established with the standalone websocket server as an HTTP request that is then upgraded to wss:// if authentication with the main Flask app is successful. Requests for charts or queries in SQL Lab are sent via HTTP to the Flask web app, including the tab’s channel ID. The server enqueues a celery job and returns a job ID to the caller. When the job completes, a notification is sent over the WSS connection, which triggers another HTTP request to fetch the cached data and display the results.

Why a separate application?

The current Flask application is not well suited for persistent websocket connections. We have evaluated several Python and Flask-based solutions, including flask-SocketIO and others, and have found the architectural changes to the Flask web app to be overly invasive. For that reason, we propose that the websocket service be a standalone application, decoupled from the main Flask application, with minimal integration points. Superset’s extensive use of Javascript and the mature Node.js websocket libraries make Node.js and TypeScript (SIP-36) an obvious choice for implementing the sidecar application.

Reconnection

The nature of persistent connections is that they will, at some point, disconnect. The system should be able to reconnect and “catch up” on any missed events. We evaluated several PubSub solutions (mainly in Redis) that could enable this durable reconnection story, and have determined that Redis Streams (Redis ≥ 5.0) fits this use case well. By storing a last received event ID, the client can pass that ID when reconnecting to fetch all messages in the channel from that point forward. For security reasons, we should periodically force the client to reconnect to revalidate authentication status.

Why not send result data over the socket connection?

While it is possible to send result data over the websocket connection, keeping the scope of the standalone service to event notifications will reduce the security footprint of the sidecar application. Fetching (potentially sensitive) data will still require necessary authentication and authorization checks at load time by routing through the Flask web app. Sending large datasets over the websocket protocol introduces potential unknown performance and consistency issues of its own. Websockets are not streams, and “the client will only be notified about a new message once all of the frames have been received and the original message payload has been reconstructed.”

Query Cancellation

Queries may be “cancelled” by calling the /superset/stop_query endpoint in SQL Lab, which simply sets query.status = QueryStatus.STOPPED for the running query. This cancellation logic is currently implemented only for queries running against hive and presto databases. Queries that have been enqueued could be cancelled prior to executing the query by adding a check in the celery worker logic. It is also possible to revoke a Celery task, which will skip execution of the task, but it won’t terminate an already executing task. Due to the limited query cancellation support in DB-API drivers, some of which is discussed here, comprehensive query cancellation functionality should be explored in a separate SIP. That said, query cancellation requests may still be issued to the existing (or similar) endpoint when users intentionally navigate away from a dashboard or Explore view with charts in a loading state.

Query deduplication

The query table in the Superset metadata database currently includes only queries run via SQL Lab. Adapting this for use with dashboards and charts may have a larger impact than we’re willing to accept at this time. Instead, a separate key-value store (e.g. Redis) may be used for tracking and preventing duplicate queries. Using a fast KV store also allows us to check for duplicate queries more efficiently in the web request cycle.

Each query issued to the backend can be fingerprinted using a hashing algorithm (SHA-256 or similar) to generate a unique key based on the following:

Normalized SQL query text
Database ID
Database schema

Prior to executing a query, a hash (key) is generated and checked against a key-value store. If the key does not exist, it is stored with a configured TTL, containing an object value with the following properties:

Query state (running, success, error)
Query type (chart, sql_lab)
Result key (cache key)

Another key is created to track the Channels and Job IDs that should be notified when this query completes (e.g. <hash>_jobs→ List[ChannelId:JobId]). If a duplicate query is issued while currently running, the Job ID is pushed onto the list, and all relevant channels are notified via websocket when the query completes. If a query is issued and a cache key exists with state == "success", a notification is triggered immediately via websocket to the client. If queries are “force” refreshed, query deduplication is performed only for currently running queries.

New or Changed Public Interfaces

New configuration options
Additional API endpoints (see Migration Plan, below)

New dependencies (optional)

Redis ≥ 5.0 (pubsub, Redis Streams) or alternative
Node.js runtime for websocket application

Migration Plan and Compatibility

Asynchronous query operations are currently an optional enhancement to Superset, and should remain that way. Configuring and running Celery workers should not be required for basic Superset operation. As such, this proposed websocket approach to async query operations should be an optional enhancement to Superset, available via a configuration flag.

For users who opt to run Superset in full async mode, the following requirements will apply under the current proposal:

Chart cache backend
SQL Lab results backend
Pubsub backend (current recommendation: Redis Streams)
Running the sidecar Node.js websocket application

Browsers that do not support websockets (very few) should fallback to synchronous operation or short polling.

Migration plan:

Introduce new configuration options to enable asynchronous operation in Superset. Pass these values to the frontend as feature flags (as appropriate)
Introduce new API endpoints in Superset’s Flask web application for async query/data loading for charts and queries
Introduce new API endpoints in Superset’s Flask web application required for interaction with the sidecar application
Build and deploy the Node.js websocket server application
Build and deploy async frontend code to consume websocket events, under feature flag

Rejected Alternatives

SSE (EventSource) over HTTP/2

Server-Sent Events (SSE) streams data over a multiplexed HTTP/2 connection. SSE is a native HTML5 feature that allows the server to keep the HTTP connection open and push data to the client. Thanks to the multiplexing feature of the HTTP/2 protocol, the number of concurrent requests per domain is not limited to 6-8, but it is virtually unlimited.

Pros

Single, multiplexed connection to the server (via HTTP/2)
- Connections can be limited via SETTINGS_MAX_CONCURRENT_STREAMS, but this should be no smaller than 100
Events pushed to the client from the server in realtime
Supports streaming data and traditional HTTP authentication methods
Minor adjustments to proxy, load balancers
Can use async generators with Flask & gevent/eventlet (untested)

Cons

HTTP/2 not supported in some browsers (IE)
SSE requires polyfill for some browsers
Requires persistent connection to the server for each tab
HTTP/2 requires TLS (not really a con, per se)
If SSE is used without HTTP/2, persistent SSE connections may saturate browser connection limit with multiple tabs
AWS “Classic” ELB does not support HTTP/2. ALB supports HTTP/2, but apparently converts it to HTTP/1.1 upstream (AWS-specific)
Does not support upstream messaging (from the client to the server)

NOTE: HTTP/2 multiplexing could still potentially be valuable alongside the websocket features, and should be investigated further.

Long polling (aka Comet)

Pros

Excellent browser support (a standard HTTP request)

Cons

Requires long-lived connections
Browser limit on concurrent connections to the same domain (6-8) results in connection blocking (HTTP/1.1)

Short polling

Pros

Excellent browser support (a standard HTTP request)
It’s the current solution for async queries in SQL Lab

Cons

Browser limit on concurrent connections to the same domain (6-8) results in connection blocking (HTTP/1.1)
Header overhead: every poll request and response contains a full set of HTTP headers
Auth overhead: each polling request must be authenticated and authorized on the server

Thanks to @etr2460 @suddjian @nytai @willbarrett @craig-rueda for feedback and review.

Update 2020-03-10

Per the below discussion, an abstracted interface will be used on the client in order to support transport mechanisms other than Websockets and the proposed sidecar application. The final form of this abstraction will take shape during implementation, but the goal will be to have UI elements interact with the generic interface, while configuration will determine which transport is used underneath.

Issue Analytics

State:
Created 4 years ago
Reactions:18
Comments:21 (21 by maintainers)

Top GitHub Comments

1reaction

robdiciucciocommented, Mar 10, 2020

@DiggidyDave the abstraction approach sounds good, as it appears that Socket.io is not really appropriate for this use case. I’ve updated the body of the SIP with notes on the client-side abstraction.

1reaction

robdiciucciocommented, Mar 4, 2020

@DiggidyDave The use cases for long-running queries in SQL Lab and loading dashboards in a performant manner feel like very different use cases. Polling could (and currently does) satisfy the SQL Lab use case, though there is significant room for improvement there. The latency introduced by polling in dashboards would be a much larger impact to user experience.

Agreed that this discussion is healthy! Architecture proposals should be actively debated and scrutinized so we all benefit from better informed decision-making.

@williaster @superset-ui will need to be upgraded to also support WebSocket connections. This upgrade should be backwards compatible and configured via feature flags. CORS does not apply to WebSockets, but securing the WS connections by validating origin domain is recommended, so we should keep the cross-domain use case in mind during implementation of these security restrictions.

Top Results From Across the Web

Add docs for GLOBAL_ASYNC_QUERIES (SIP-39) (#12573)

To enable async queries for dashboards and Explore, the following dependencies are required: + +- Redis 5.0+ (the feature utilizes [Redis ...

Release Notes for Superset 1.0.0

Global Async Query Support · Improved Data Visualizations · Discovery and Navigation · Alerts and Reporting.

fivem server list query returned an error system.threading.tasks. ...

[SIP-39] Global Async Query Support. Motivation. Loading time on dashboards with many charts is excessive due to the number of concurrent connections to...

Async Queries via Celery - Apache Superset

Celery. On large analytic databases, it's common to run queries that execute for minutes or hours. To enable support for long running queries...

apache/superset 1.0.1 on GitHub

Apache Superset is a modern, enterprise-ready business intelligence web application. ... global async query support make this the most performant and user- ...