[SIP-16] Proposal for loading additional polygon encodings
See original GitHub issueMotivation
When analyzing geo spatial data, a common work flow consists of grouping the data across different spatial dimensions, e.g., ZIP code, city, state or country. In order to visualize the results, a user can select either the “Country Map” or the “deck.gl Polygon” visualizations. Neither of them are appropriate for exploring the data across different hierarchical regions: the former is limited to country only, while the latter requires all shapes to be pre-joined (except for geohash), adding duplicate data to the data source.
Proposed Change
Current workflow
In this SIP I propose a way of adding new encodings to the “deck.gl Polygon” visualization. Currently, it supports:
- Polyline
- JSON
- geohash (square)
The first two require the shape to be present in the datasource, while the third one is computed the shape on the fly based on a column. In order to explore a dataset across a spatial hierarchy, data would have to pre-joined with all polygons in the datasource, eg:
timestamp | country | country_polygon | state | state_polygon | city | city_polygon |
---|---|---|---|---|---|---|
1549410628 | US | {…} | CA | {…} | San Francisco | {…} |
1549410675 | US | {…} | CA | {…} | San Francisco | {…} |
This is clearly inefficient.
Proposed workflow
I propose an alternative workflow where the polygon shape is joined in the Python backend (viz.py
). This is similar to how the geohash encoding currently works: it’s computed on the fly by the Python backend based on the value of a column, and sent to the frontend in the payload. The approach described here has been used at Lyft for US and Canada postal codes for more than 6 months; see https://github.com/apache/incubator-superset/commit/9c10547f19b628e81cbcd6e1fbac86a70ea510be for the US ZIP code implementation.
Note that the current approach for geohash is still inefficient, since it sends the joined data to the browser. When a granularity is selected, enabling the play slider, this results in duplicate data being sent. It’s better to send the polygon shapes in a separate attribute of the payload, and perform the join on the browser instead.
In the new workflow, users will be able to specify new encodings in config.py
(or superset_config.py
). Each encoding is defined by an adapter class, responsible for serializing to JSON the shape associated with the region column. Eg, if the geohash visualization type didn’t exist we would implement it in the proposed system as follows:
import geohash
from superset.polygon import PolygonEncoding
class GeohashEncoding(PolygonEncoding):
name = 'geohash (square)'
def to_location(codes):
for code in codes:
lat, lon = geohash.decode(code)
yield lon, lat
def to_polygon(codes):
for code in codes:
p = geohash.bbox(code)
yield [
[p.get('w'), p.get('n')],
[p.get('e'), p.get('n')],
[p.get('e'), p.get('s')],
[p.get('w'), p.get('s')],
[p.get('w'), p.get('n')],
]
This would be registered in config.py
:
FEATURE_FLAGS = {
'EXTRA_POLYGON_ENCODINGS': [GeohashEncoding],
}
Other adapters might perform database queries in order to fetch the polygon associated with each value, which is why the methods take a list of codes for efficiency. At Lyft we cache the shapes, fetching from the database only values that are missing.
New or Changed Public Interfaces
This SIP affects only the “deck.gl Polygon” visualization type. The Python backend will use the adapter classes, and the frontend will display new encodings. Here’s a screenshot showing US ZIP codes and Canada FSAs:
Even though this is a small feature, one of the reasons of why I’m proposing this as a SIP is because it introduces new logic to viz.py
, and I’m unsure how that will affect embeddable components. (👀 @xtinec @kristw @williaster)
New dependencies
No new dependencies are needed.
Migration Plan and Compatibility
Not necessary.
Rejected Alternatives
My initial implementation of a visualization for ZIP codes was as a custom new visualization. This was hard to manage (in part because of merge conflicts) and redundant, requiring a lot of duplicate work as features were added to the “deck.gl Polygon” visualization. At some point last year I merged the functionality into the deck.gl visualization.
Future work
There was a discussion between me and @mistercrunch where we considered creating “spatial columns” in the datasource configuration, similar to how metrics or derived columns can be created. Eg, a datasource with these 4 columns:
- pickup_lat
- pickup_lon
- dropoff_geohash
Would be configured to have 2 spatial columns: “pickup”, composed from pickup_lat
and pickup_lon
, and “dropoff”, derived from dropoff_geohash
.
If we had that mechanism for spatial columns in place, it would be useful if we could load a series of polygons using the CLI, eg:
# this downloads 2 GBs of data and stores in the main database
$ superset load_polygon US_zip
Then in the spatial configuration the user would be able to select a column and mark it as of type “US_zip”, and the “deck.gl Polygon” visualization would just work. We could provide a list of common polygons (ZIP, city, state, country), and users would be able to load their own. This way, the “Country Map” visualization could be deprecated in favor of the deck.gl one.
The downside of this approach is that the shapes would be stored in the main database, which might not be inefficient. At Lyft we use Postgres with GIS extensions for the US ZIP codes, but MySQL for the main database.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:9 (9 by maintainers)
Top GitHub Comments
@kristw:
Imagine you want to look at a metric per ZIP code per hour:
Even though
zipcode
andzipcode_geoson
are time-independent, they are repeated in the dataset for each hour. The polygons might be complex shapes, so this is a lot of duplicate data.Instead, we just pass the ZIP code itself:
There’s still some duplicate data, but now it’s 5 bytes per row only.
These are functions from the
geohash
module. The first one decodes a geohash into a lat/lon pair, the second one into a bounding box, IIRC.code
is the geohash code, something like9q8yyu
. Let me know if you have suggestions for a better name, it’s something that should represent a geohash, a ZIP code, an FSA code.+1 on this.
+1 on this, but I’m curious if we have a plan to handle cases like this, where some of the logic lives in the backend.
Closing for now… @betodealmeida please feel free to reopen this if you want to rekindle it.