Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SIP-16] Proposal for loading additional polygon encodings

See original GitHub issue

Motivation

When analyzing geo spatial data, a common work flow consists of grouping the data across different spatial dimensions, e.g., ZIP code, city, state or country. In order to visualize the results, a user can select either the “Country Map” or the “deck.gl Polygon” visualizations. Neither of them are appropriate for exploring the data across different hierarchical regions: the former is limited to country only, while the latter requires all shapes to be pre-joined (except for geohash), adding duplicate data to the data source.

Proposed Change

Current workflow

In this SIP I propose a way of adding new encodings to the “deck.gl Polygon” visualization. Currently, it supports:

Polyline
JSON
geohash (square)

The first two require the shape to be present in the datasource, while the third one is computed the shape on the fly based on a column. In order to explore a dataset across a spatial hierarchy, data would have to pre-joined with all polygons in the datasource, eg:

timestamp	country	country_polygon	state	state_polygon	city	city_polygon
1549410628	US	{…}	CA	{…}	San Francisco	{…}
1549410675	US	{…}	CA	{…}	San Francisco	{…}

This is clearly inefficient.

Proposed workflow

I propose an alternative workflow where the polygon shape is joined in the Python backend (viz.py). This is similar to how the geohash encoding currently works: it’s computed on the fly by the Python backend based on the value of a column, and sent to the frontend in the payload. The approach described here has been used at Lyft for US and Canada postal codes for more than 6 months; see https://github.com/apache/incubator-superset/commit/9c10547f19b628e81cbcd6e1fbac86a70ea510be for the US ZIP code implementation.

Note that the current approach for geohash is still inefficient, since it sends the joined data to the browser. When a granularity is selected, enabling the play slider, this results in duplicate data being sent. It’s better to send the polygon shapes in a separate attribute of the payload, and perform the join on the browser instead.

In the new workflow, users will be able to specify new encodings in config.py (or superset_config.py). Each encoding is defined by an adapter class, responsible for serializing to JSON the shape associated with the region column. Eg, if the geohash visualization type didn’t exist we would implement it in the proposed system as follows:

import geohash

from superset.polygon import PolygonEncoding


class GeohashEncoding(PolygonEncoding):

    name = 'geohash (square)'

    def to_location(codes):
        for code in codes:
            lat, lon = geohash.decode(code)
            yield lon, lat

    def to_polygon(codes):
        for code in codes:
            p = geohash.bbox(code)
            yield [
                [p.get('w'), p.get('n')],
                [p.get('e'), p.get('n')],
                [p.get('e'), p.get('s')],
                [p.get('w'), p.get('s')],
                [p.get('w'), p.get('n')],
            ]

This would be registered in config.py:

FEATURE_FLAGS = {
    'EXTRA_POLYGON_ENCODINGS': [GeohashEncoding],
}

Other adapters might perform database queries in order to fetch the polygon associated with each value, which is why the methods take a list of codes for efficiency. At Lyft we cache the shapes, fetching from the database only values that are missing.

New or Changed Public Interfaces

This SIP affects only the “deck.gl Polygon” visualization type. The Python backend will use the adapter classes, and the frontend will display new encodings. Here’s a screenshot showing US ZIP codes and Canada FSAs:

Even though this is a small feature, one of the reasons of why I’m proposing this as a SIP is because it introduces new logic to viz.py, and I’m unsure how that will affect embeddable components. (👀 @xtinec @kristw @williaster)

New dependencies

No new dependencies are needed.

Migration Plan and Compatibility

Not necessary.

Rejected Alternatives

My initial implementation of a visualization for ZIP codes was as a custom new visualization. This was hard to manage (in part because of merge conflicts) and redundant, requiring a lot of duplicate work as features were added to the “deck.gl Polygon” visualization. At some point last year I merged the functionality into the deck.gl visualization.

Future work

There was a discussion between me and @mistercrunch where we considered creating “spatial columns” in the datasource configuration, similar to how metrics or derived columns can be created. Eg, a datasource with these 4 columns:

pickup_lat
pickup_lon
dropoff_geohash

Would be configured to have 2 spatial columns: “pickup”, composed from pickup_lat and pickup_lon, and “dropoff”, derived from dropoff_geohash.

If we had that mechanism for spatial columns in place, it would be useful if we could load a series of polygons using the CLI, eg:

# this downloads 2 GBs of data and stores in the main database
$ superset load_polygon US_zip

Then in the spatial configuration the user would be able to select a column and mark it as of type “US_zip”, and the “deck.gl Polygon” visualization would just work. We could provide a list of common polygons (ZIP, city, state, country), and users would be able to load their own. This way, the “Country Map” visualization could be deprecated in favor of the deck.gl one.

The downside of this approach is that the shapes would be stored in the main database, which might not be inefficient. At Lyft we use Postgres with GIS extensions for the US ZIP codes, but MySQL for the main database.

Issue Analytics

State:
Created 5 years ago
Reactions:5
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

betodealmeidacommented, Feb 20, 2019

@kristw:

data would have to pre-joined with all polygons in the datasource

I am not sure I fully understand the data. What is each data point? Is it individual data point or aggregated? Could you give a more concrete example?

Imagine you want to look at a metric per ZIP code per hour:

SELECT COUNT(), zipcode, zipcode_geojson
FROM table
GROUP BY DATE_TRUNC(time, "hour"), zipcode, zipcode_geojson

Even though zipcode and zipcode_geoson are time-independent, they are repeated in the dataset for each hour. The polygons might be complex shapes, so this is a lot of duplicate data.

Instead, we just pass the ZIP code itself:

SELECT COUNT(), zipcode
FROM table
GROUP BY DATE_TRUNC(time, "hour"), zipcode

There’s still some duplicate data, but now it’s 5 bytes per row only.

What does geohash.decode and geohash.bbox do? What is the code argument?

These are functions from the geohash module. The first one decodes a geohash into a lat/lon pair, the second one into a bounding box, IIRC. code is the geohash code, something like 9q8yyu. Let me know if you have suggestions for a better name, it’s something that should represent a geohash, a ZIP code, an FSA code.

Regarding viz.py and embeddable components. Most of the visualization js code will be move out of this repo as individual npm package plugin. I haven’t moved the deck.gl directory but could do so after the first round or altogether. Ideally, they should be moved too. This may increase the overhead in development a bit but make them embeddable and keep the main repo lean.

+1 on this.

The backend logic for legacy charts can still live in viz.py and be updated in the meantime but no new chart should add logic to viz.py and we should gradually reduce the code from this file.

+1 on this, but I’m curious if we have a plan to handle cases like this, where some of the logic lives in the backend.

0reactions

rusackascommented, Apr 21, 2021

Closing for now… @betodealmeida please feel free to reopen this if you want to rekindle it.