Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support GeoPandas GeoDataFrames

See original GitHub issue

Context:

Datashader requires plain arrays of coordinates (“ragged arrays” in case of lines/polygons, with additional offset/indices arrays) to efficiently visualize geometries (and this is also the representation that SpatialPandas uses under the hood).
GeoPandas stores geometries as “opaque” Shapely objects (wrapping GEOS C++ object), and converting this to an array of coordinates is currently not always very efficient (although it’s certainly possible, it’s also what spatialpandas does in the from_geopandas conversion code eg here)

With the latest release of PyGEOS, the conversion from geometries to (ragged) coordinate arrays can be done much more efficiently, though.

Function using PyGEOS to convert array of GEOS geometries to arrays of coordinates / offsets (+ putting those in a spatialpandas array)

def get_flat_coords_offset_arrays(arr):
    """
    Version for MultiPolygon data
    """
    # explode/flatten the MultiPolygons
    arr_flat, part_indices = pygeos.get_parts(arr, return_index=True)
    # the offsets into the multipolygon parts
    offsets1 = np.insert(np.bincount(part_indices).cumsum(), 0, 0)

    # explode/flatten the Polygons into Rings
    arr_flat2, ring_indices = pygeos.geometry.get_rings(arr_flat, return_index=True)
    # the offsets into the exterior/interior rings of the multipolygon parts 
    offsets2 = np.insert(np.bincount(ring_indices).cumsum(), 0, 0)

    # the coords and offsets into the coordinates of the rings
    coords, indices = pygeos.get_coordinates(arr_flat2, return_index=True)
    offsets3 = np.insert(np.bincount(indices).cumsum(), 0, 0)
    
    return coords, offsets1, offsets2, offsets3

def spatialpandas_from_pygeos(arr):
    """
    Create the actual spatialpandas MultiPolygonArray by putting the individual arrays
    into a pyarrow ListArray
    """
    coords, offsets1, offsets2, offsets3 = get_flat_coords_offset_arrays(arr)
    coords_flat = coords.ravel()
    offsets3 *= 2
    
    # create a pyarrow array from this
    _parr3 = pa.ListArray.from_arrays(pa.array(offsets3), pa.array(coords_flat))
    _parr2 = pa.ListArray.from_arrays(pa.array(offsets2), _parr3)
    parr = pa.ListArray.from_arrays(pa.array(offsets1), _parr2)
    
    return spatialpandas.geometry.MultiPolygonArray(parr)

With such a faster conversion available, it becomes more interesting for Datashader to directly support geopandas.GeoDataFrame, instead of requiring an up-front conversion to spatialpandas.GeoDataFrame. Currently, the spatialpandas requirement is hardcoded here (for polygons()):

https://github.com/holoviz/datashader/blob/1ae52b65ec8a79920e5db9c6c04487f254428553/datashader/core.py#L694-L701

Adding support for GeoPandas can be done, using the function I defined above, with something like (leaving aside imports of geopandas/pygeos):

    from spatialpandas import GeoDataFrame
    from spatialpandas.dask import DaskGeoDataFrame
    if isinstance(source, DaskGeoDataFrame):
        # Downselect partitions to those that may contain polygons in viewport
        x_range = self.x_range if self.x_range is not None else (None, None)
        y_range = self.y_range if self.y_range is not None else (None, None)
        source = source.cx_partitions[slice(*x_range), slice(*y_range)]
+   elif isinstance(source, geopandas.GeoDataFrame):
+      # Downselect actual rows to those for which the polygon is in viewport
+      x_range = self.x_range if self.x_range is not None else (None, None)
+      y_range = self.y_range if self.y_range is not None else (None, None)
+      source = source.cx[slice(*x_range), slice(*y_range)]
+      # Convert the subset to ragged array format of spatialpandas
+      geometries = spatialpandas_from_pygeos(source.geometry.array.data)
+      source = pd.DataFrame(source)
+      source["geometry"] = geometries
    elif not isinstance(source, GeoDataFrame):
        raise ValueError(
            "source must be an instance of spatialpandas.GeoDataFrame or \n"

This patch is what I tried in the following notebook, first using a smaller countries/provinces dataset from NaturalEarth, and then with a larger NYC building footprints dataset (similar to https://examples.pyviz.org/nyc_buildings/nyc_buildings.html).

Notebook: https://nbviewer.jupyter.org/gist/jorisvandenbossche/3e7ce14cb5118daa0f6097d686981c9f

Some observations:

This actually works nicely!
Initial rendering with datashader is a bit slower when directly using geopandas.GeoDataFrame because of the extra conversion step. But, the conversion takes less time than the actual rendering, so it’s only a relatively small slowdown.
Zooming into small areas is really fast with the large dataset. And actually faster as using spatialpandas.GeoDataFrame (because I added a .cx spatial subsetting step in my patch above, filtering the data before rendering). For spatialpandas, such subsetting is only added for the dask version.

Gif of the notebook in action (the buildings dataset is fully loaded in memory, and not pararellized with dask, unlike the PyViz gallery example), interactively zooming into a GeoPandas dataframe with Datashader and Holoviews:

Peek 2021-06-08 13-45

(note this was done a bit manually with Holoviews DynamicMap and a callback with Datashader code, because the integrated datashade functionality of Holoviews/HvPlot wouldn’t preserve the geopandas.GeoDataFrame with the current versions)

So, what’s the way forward here? I think I showed that it can be useful for Datashader to directly support GeoPandas, and that it can also be done with a relatively small change to datashader. The big question, though, is about the “GEOS -> ragged coordinate arrays -> spatialpandas array” conversion. Where should this live / how should DataShader and GeoPandas interact?

Some initial thoughts about this:

The quickest way to get this working is to do this conversion in DataShader (the above functions only rely on pygeos (which the user will already have when using GeoPandas) and pyarrow/spatialpandas (which are already requirements for this part of datashader)). But, long term, is this code that Datashader wants to maintain? Or is there a more logical place for this code?
It could also live in SpatialPandas, since they already have code for such conversion of GeoPandas <-> SpatialPandas (and it would optimize its current implementation of that). But, should a user need to have SpatialPandas to plot GeoPandas with Datashader? (see also last bullet point)
Alternatively, GeoPandas could add a function or method that converts its geometries into this required format, and then Datashader can call that method to get the data it needs. Long term, this might be the better solution (since other projects interacting with geopandas might also want to get the geometries in this format).
How to communicate this data? In the current POC version included above, I first get the raw coordinate and offset arrays, and then convert them into a pyarrow.ListArray to then convert it to a spatialpandas MultiPolygonArray. But in the end, what Datashader needs is only the raw coordinates and offsets arrays.
For example, for rendering polygons, you access .buffer_values and .buffer_offsets of the MultiPolygonArray, which gives back the raw coordinate and offset arrays. So in theory, this roundtrip through pyarrow and spatialpandas is not needed, and some method could convert GeoPandas geometries into coordinate/offset arrays, which could be directly handled by datashader as is. This would however require a bit more changes in datashader in the way that data gets passed down from Canvas.polygons() into the glyph rendering (as currently that uses the spatiapandas array as container for the coordinates/offsets).

One possible idea (relating to the third bullet point) is to standardize on some kind of __geo_arrow_arrays__ interface (returning the coordinate + offset arrays), similarly to the existing __geo_interface__ that returns the geometries in GeoJSON-like dictionary (and which can be used now for accepting any “geometry-like” object even from libraries you don’t know).

Issue Analytics

State:
Created 2 years ago
Reactions:9
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

jbednarcommented, Jun 15, 2021

Perfect, thanks! If I can tell people to use GeoPandas for all their 2D planar shapes regardless of what they are, then I am very happy for Datashader to work directly with whatever the rawest form of coordinate access GeoPandas can provide as the way to work with ragged shapes using Numba and Dask. (Non-ragged shapes like dense n-D arrays of same-length lines can already be supported by xarray and numpy.) Excellent!

1reaction

jorisvandenbosschecommented, Jun 8, 2021

@ablythed thanks for the reminder. I started a draft at the time, but now finished it up. I updated the top post.

Top Results From Across the Web

geopandas.GeoDataFrame

A GeoDataFrame object is a pandas.DataFrame that has a column with geometry. In addition to the standard DataFrame constructor arguments, GeoDataFrame also ...

Data Structures — GeoPandas 0.12.2+0.gefcb367.dirty ...

A GeoDataFrame is a tabular data structure that contains a GeoSeries . The most important property of a GeoDataFrame is that it always...

Introduction to GeoPandas

Introduction to GeoPandas#. This quick tutorial introduces the key concepts and basic features of GeoPandas to help you get started with your projects....

geopandas.GeoDataFrame.to_file

The underlying library that is used to write the file. Currently, the supported options are “fiona” and “pyogrio”. Defaults to “fiona” if installed,...

geopandas.GeoDataFrame.explore

Supported are all schemes provided by mapclassify (e.g. 'BoxPlot' , 'EqualInterval' , 'FisherJenks' , 'FisherJenksSampled' , 'HeadTailBreaks' , 'JenksCaspall' , ...