question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API: How to deal with different spatial index implementations?

See original GitHub issue

Now that we are starting to use PyGEOS in GeoPandas (with https://github.com/geopandas/geopandas/pull/1154 merged), one of the obvious follow-up items is to use the STRtree implementation of PyGEOS, especially the bulk query for the sjoin implementation.

However, creation of the spatial index is costly, which is the reason we cache this in the sindex property on a GeoSeries/GeoDataFrame. Which also means that if we want to cache the PyGEOS STRtree in this property, that in the current API, it would need to replace the rtree spatial index.

Since multiple spatial index / tree implemenation exist, and typically they can have different trade-offs (eg creation speed vs query speed, optimized for certain types of geometries, …), this brings up some questions:

  • Do we simply want to replace the rtree with pygeos for the sindex property?
  • Or, do we want to rethink the API to allow for multiple (pluggable?) spatial index implementations?

For the first option, I think we need some more extensive testing to ensure the GEOS STRtree is actually faster than libspatialindex / rtree (we could still merge something optional to make testing easier, though). But, this might also be a good opportunity to rethink the API a bit more generally. Because also next to GEOS and rtree, there might be other spatial indexes that people might want to use with geopandas.

cc @brendan-ward @adriangb

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:20 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
martinfleiscommented, Mar 24, 2020

I really like the idea of creating a stand-alone spatial index provider package. It will bring some complexity, but I don’t think that it would be so hard to make a clear API to link geopandas to new sindex package. Each spatial index class (STRtree, rtree, KDtree, maybe h3?) can specify if we should give it bounding boxes of geometries (and which kind of geometries) and which methods are available. It would work a bit like fiona, which provides API to different drivers where each allows different things (e.g. layers, data types). The package would take care of wrapping them and providing a consistent API linked to geopandas.

I like the flexibility which can come with it. We could in theory flexibly switch between STRtree and KDtree in nearest neighbour search based on geom type etc. Using one API, without necessity to figure out how to build and query each tree because the new package would take care of tree building under the hood and unify query API for users.

1reaction
brendan-wardcommented, Mar 24, 2020

I like the idea of pluggable spatial index implementations, though it brings with it API and implementation complexity. Agreed on different index implementations being theoretically more optimal for different datasets, though until they are similarly vectorized, those vectorized in C / Cython have a distinct advantage over those where the looping is in Python.

In thinking this through, could this lead to creating a stand-alone rtree spatial index provider (i.e., new package outside geopandas), which provides the implementation of constructing and querying the index? That seems like it might open up the opportunity there to better optimize the performance of certain index operations, such as perhaps vectorizing index queries in Cython instead of looping in Python. The appeal would be that it moves the implementation of that outside geopandas.

At minimum, it seems like an index provider would need to implement:

  • tree construction from geometries or bounding boxes
  • singular tree query given a source geometry or bounding box, plus optional spatial predicate
  • bulk tree query given a 1D array of source geometries or bounding boxes, plus optional spatial predicate

Haven’t fully thought through how that approach would need to handle compatibility of the underlying geometry objects; that is certain to bring with it some challenges.

In all my (limited) testing so far, now that we have STRtree::query_bulk that uses prepared geometries and predicates in pygeos, it has been much faster than comparable sjoin operations using rtree in geopandas.

One thing for us to test: so far I have not noticed a significant performance hit from construction of the STRtree, even for larger sets of geometries. As described, the tree isn’t constructed until the first query, but so far it looks like subsequent queries take about as long as the first.

This is in contrast to existing sindex implementation, which does take a very noticeable amount of time.

Also worth considering a little bit in this API: how the tree is used for adjacency queries like nearest neighbor. Not all spatial index implementations provide or are optimal for those operations, but maybe some way of the index provider indicating that it supports them, so that we can use them if available - vs. constructing the tree again from scratch for a nearest operation separately from an sjoin operation. Of course, if trees are fast to create, then that reuse is less important.

Read more comments on GitHub >

github_iconTop Results From Across the Web

System Design: Design a Geo-Spatial index for real-time ...
The intention of this article is to look into how to design the back end infrastructure for a Geo-spatial index in real life....
Read more >
Implementations of spatial indexes in Haskell? - Stack Overflow
Are there any good implementations of spatial indexes in Haskell such as R-tree, kd-tree, etc... haskell · spatial-index.
Read more >
Spatial index - GeoPandas
GeoPandas offers built-in support for spatial indexing using an R-Tree algorithm. ... The concrete implementations currently available are geopandas.sindex.
Read more >
Modify a spatial index—ArcGIS Pro | Documentation
In the Catalog pane, connect to the geodatabase that contains the feature class with the spatial index you want to modify. · Right-click...
Read more >
A dive into spatial search algorithms | by Vladimir Agafonkin
Spatial indices are a family of algorithms that arrange geometric data for efficient search. For example, doing queries like “return all buildings in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found