API: How to deal with different spatial index implementations?
See original GitHub issueNow that we are starting to use PyGEOS in GeoPandas (with https://github.com/geopandas/geopandas/pull/1154 merged), one of the obvious follow-up items is to use the STRtree implementation of PyGEOS, especially the bulk query for the sjoin
implementation.
However, creation of the spatial index is costly, which is the reason we cache this in the sindex
property on a GeoSeries/GeoDataFrame. Which also means that if we want to cache the PyGEOS STRtree in this property, that in the current API, it would need to replace the rtree
spatial index.
Since multiple spatial index / tree implemenation exist, and typically they can have different trade-offs (eg creation speed vs query speed, optimized for certain types of geometries, …), this brings up some questions:
- Do we simply want to replace the rtree with pygeos for the
sindex
property? - Or, do we want to rethink the API to allow for multiple (pluggable?) spatial index implementations?
For the first option, I think we need some more extensive testing to ensure the GEOS STRtree is actually faster than libspatialindex
/ rtree
(we could still merge something optional to make testing easier, though).
But, this might also be a good opportunity to rethink the API a bit more generally. Because also next to GEOS and rtree, there might be other spatial indexes that people might want to use with geopandas.
Issue Analytics
- State:
- Created 3 years ago
- Comments:20 (20 by maintainers)
Top GitHub Comments
I really like the idea of creating a stand-alone spatial index provider package. It will bring some complexity, but I don’t think that it would be so hard to make a clear API to link geopandas to new sindex package. Each spatial index class (STRtree, rtree, KDtree, maybe h3?) can specify if we should give it bounding boxes of geometries (and which kind of geometries) and which methods are available. It would work a bit like fiona, which provides API to different drivers where each allows different things (e.g. layers, data types). The package would take care of wrapping them and providing a consistent API linked to geopandas.
I like the flexibility which can come with it. We could in theory flexibly switch between STRtree and KDtree in nearest neighbour search based on geom type etc. Using one API, without necessity to figure out how to build and query each tree because the new package would take care of tree building under the hood and unify query API for users.
I like the idea of pluggable spatial index implementations, though it brings with it API and implementation complexity. Agreed on different index implementations being theoretically more optimal for different datasets, though until they are similarly vectorized, those vectorized in C / Cython have a distinct advantage over those where the looping is in Python.
In thinking this through, could this lead to creating a stand-alone rtree spatial index provider (i.e., new package outside geopandas), which provides the implementation of constructing and querying the index? That seems like it might open up the opportunity there to better optimize the performance of certain index operations, such as perhaps vectorizing index queries in Cython instead of looping in Python. The appeal would be that it moves the implementation of that outside geopandas.
At minimum, it seems like an index provider would need to implement:
Haven’t fully thought through how that approach would need to handle compatibility of the underlying geometry objects; that is certain to bring with it some challenges.
In all my (limited) testing so far, now that we have
STRtree::query_bulk
that uses prepared geometries and predicates in pygeos, it has been much faster than comparable sjoin operations using rtree in geopandas.One thing for us to test: so far I have not noticed a significant performance hit from construction of the STRtree, even for larger sets of geometries. As described, the tree isn’t constructed until the first query, but so far it looks like subsequent queries take about as long as the first.
This is in contrast to existing sindex implementation, which does take a very noticeable amount of time.
Also worth considering a little bit in this API: how the tree is used for adjacency queries like nearest neighbor. Not all spatial index implementations provide or are optimal for those operations, but maybe some way of the index provider indicating that it supports them, so that we can use them if available - vs. constructing the tree again from scratch for a nearest operation separately from an sjoin operation. Of course, if trees are fast to create, then that reuse is less important.