Support GeoPandas GeoDataFrames
See original GitHub issueContext:
- Datashader requires plain arrays of coordinates (“ragged arrays” in case of lines/polygons, with additional offset/indices arrays) to efficiently visualize geometries (and this is also the representation that SpatialPandas uses under the hood).
- GeoPandas stores geometries as “opaque” Shapely objects (wrapping GEOS C++ object), and converting this to an array of coordinates is currently not always very efficient (although it’s certainly possible, it’s also what spatialpandas does in the
from_geopandas
conversion code eg here)
With the latest release of PyGEOS, the conversion from geometries to (ragged) coordinate arrays can be done much more efficiently, though.
Function using PyGEOS to convert array of GEOS geometries to arrays of coordinates / offsets (+ putting those in a spatialpandas array)
def get_flat_coords_offset_arrays(arr):
"""
Version for MultiPolygon data
"""
# explode/flatten the MultiPolygons
arr_flat, part_indices = pygeos.get_parts(arr, return_index=True)
# the offsets into the multipolygon parts
offsets1 = np.insert(np.bincount(part_indices).cumsum(), 0, 0)
# explode/flatten the Polygons into Rings
arr_flat2, ring_indices = pygeos.geometry.get_rings(arr_flat, return_index=True)
# the offsets into the exterior/interior rings of the multipolygon parts
offsets2 = np.insert(np.bincount(ring_indices).cumsum(), 0, 0)
# the coords and offsets into the coordinates of the rings
coords, indices = pygeos.get_coordinates(arr_flat2, return_index=True)
offsets3 = np.insert(np.bincount(indices).cumsum(), 0, 0)
return coords, offsets1, offsets2, offsets3
def spatialpandas_from_pygeos(arr):
"""
Create the actual spatialpandas MultiPolygonArray by putting the individual arrays
into a pyarrow ListArray
"""
coords, offsets1, offsets2, offsets3 = get_flat_coords_offset_arrays(arr)
coords_flat = coords.ravel()
offsets3 *= 2
# create a pyarrow array from this
_parr3 = pa.ListArray.from_arrays(pa.array(offsets3), pa.array(coords_flat))
_parr2 = pa.ListArray.from_arrays(pa.array(offsets2), _parr3)
parr = pa.ListArray.from_arrays(pa.array(offsets1), _parr2)
return spatialpandas.geometry.MultiPolygonArray(parr)
With such a faster conversion available, it becomes more interesting for Datashader to directly support geopandas.GeoDataFrame
, instead of requiring an up-front conversion to spatialpandas.GeoDataFrame
.
Currently, the spatialpandas requirement is hardcoded here (for polygons()
):
Adding support for GeoPandas can be done, using the function I defined above, with something like (leaving aside imports of geopandas/pygeos):
from spatialpandas import GeoDataFrame
from spatialpandas.dask import DaskGeoDataFrame
if isinstance(source, DaskGeoDataFrame):
# Downselect partitions to those that may contain polygons in viewport
x_range = self.x_range if self.x_range is not None else (None, None)
y_range = self.y_range if self.y_range is not None else (None, None)
source = source.cx_partitions[slice(*x_range), slice(*y_range)]
+ elif isinstance(source, geopandas.GeoDataFrame):
+ # Downselect actual rows to those for which the polygon is in viewport
+ x_range = self.x_range if self.x_range is not None else (None, None)
+ y_range = self.y_range if self.y_range is not None else (None, None)
+ source = source.cx[slice(*x_range), slice(*y_range)]
+ # Convert the subset to ragged array format of spatialpandas
+ geometries = spatialpandas_from_pygeos(source.geometry.array.data)
+ source = pd.DataFrame(source)
+ source["geometry"] = geometries
elif not isinstance(source, GeoDataFrame):
raise ValueError(
"source must be an instance of spatialpandas.GeoDataFrame or \n"
This patch is what I tried in the following notebook, first using a smaller countries/provinces dataset from NaturalEarth, and then with a larger NYC building footprints dataset (similar to https://examples.pyviz.org/nyc_buildings/nyc_buildings.html).
Notebook: https://nbviewer.jupyter.org/gist/jorisvandenbossche/3e7ce14cb5118daa0f6097d686981c9f
Some observations:
- This actually works nicely!
- Initial rendering with datashader is a bit slower when directly using geopandas.GeoDataFrame because of the extra conversion step. But, the conversion takes less time than the actual rendering, so it’s only a relatively small slowdown.
- Zooming into small areas is really fast with the large dataset. And actually faster as using spatialpandas.GeoDataFrame (because I added a
.cx
spatial subsetting step in my patch above, filtering the data before rendering). For spatialpandas, such subsetting is only added for the dask version.
Gif of the notebook in action (the buildings dataset is fully loaded in memory, and not pararellized with dask, unlike the PyViz gallery example), interactively zooming into a GeoPandas dataframe with Datashader and Holoviews:
(note this was done a bit manually with Holoviews DynamicMap and a callback with Datashader code, because the integrated datashade functionality of Holoviews/HvPlot wouldn’t preserve the geopandas.GeoDataFrame with the current versions)
So, what’s the way forward here? I think I showed that it can be useful for Datashader to directly support GeoPandas, and that it can also be done with a relatively small change to datashader. The big question, though, is about the “GEOS -> ragged coordinate arrays -> spatialpandas array” conversion. Where should this live / how should DataShader and GeoPandas interact?
Some initial thoughts about this:
- The quickest way to get this working is to do this conversion in DataShader (the above functions only rely on pygeos (which the user will already have when using GeoPandas) and pyarrow/spatialpandas (which are already requirements for this part of datashader)). But, long term, is this code that Datashader wants to maintain? Or is there a more logical place for this code?
- It could also live in SpatialPandas, since they already have code for such conversion of GeoPandas <-> SpatialPandas (and it would optimize its current implementation of that). But, should a user need to have SpatialPandas to plot GeoPandas with Datashader? (see also last bullet point)
- Alternatively, GeoPandas could add a function or method that converts its geometries into this required format, and then Datashader can call that method to get the data it needs. Long term, this might be the better solution (since other projects interacting with geopandas might also want to get the geometries in this format).
- How to communicate this data? In the current POC version included above, I first get the raw coordinate and offset arrays, and then convert them into a
pyarrow.ListArray
to then convert it to a spatialpandasMultiPolygonArray
. But in the end, what Datashader needs is only the raw coordinates and offsets arrays.
For example, for rendering polygons, you access.buffer_values
and.buffer_offsets
of the MultiPolygonArray, which gives back the raw coordinate and offset arrays. So in theory, this roundtrip through pyarrow and spatialpandas is not needed, and some method could convert GeoPandas geometries into coordinate/offset arrays, which could be directly handled by datashader as is. This would however require a bit more changes in datashader in the way that data gets passed down fromCanvas.polygons()
into theglyph
rendering (as currently that uses the spatiapandas array as container for the coordinates/offsets).
One possible idea (relating to the third bullet point) is to standardize on some kind of __geo_arrow_arrays__
interface (returning the coordinate + offset arrays), similarly to the existing __geo_interface__
that returns the geometries in GeoJSON-like dictionary (and which can be used now for accepting any “geometry-like” object even from libraries you don’t know).
Issue Analytics
- State:
- Created 2 years ago
- Reactions:9
- Comments:7 (2 by maintainers)
Top GitHub Comments
Perfect, thanks! If I can tell people to use GeoPandas for all their 2D planar shapes regardless of what they are, then I am very happy for Datashader to work directly with whatever the rawest form of coordinate access GeoPandas can provide as the way to work with ragged shapes using Numba and Dask. (Non-ragged shapes like dense n-D arrays of same-length lines can already be supported by xarray and numpy.) Excellent!
@ablythed thanks for the reminder. I started a draft at the time, but now finished it up. I updated the top post.