Integration of PyGEOS and Shapely
See original GitHub issueFor context, see also #501 on vectorizing all Shapely functions. More specific context: PyGEOS (https://github.com/pygeos/pygeos/, https://caspervdw.github.io/Introducing-Pygeos/) is a new python package that exposes geospatial operations from GEOS, similarly to Shapely, but through fast, vectorized (numpy-based) functions. It gives a significant speed-up (5x to 100x speed-up (depending on the kind of operation) for operations on arrays), opens up the possibility of parallelizing calls to pygeos functions, and will provide a needed performance boost to GeoPandas and others who work with arrays of geometries.
It has a different initial focus (performance) but clearly also has a lot of overlap with Shapely. We have been discussing offline with a few people on how a possible integration between both packages could look like, but thought to make an issue here as well to have a public recollection of it.
To quote Sean: The unanswered question is: how do we organize GeoPandas, Shapely, and PyGEOS to maximize the benefits for the projects, their developers, and their users while also distributing the costs in an equitable way?
Can we integrate the PyGEOS functionality fully into Shapely, avoiding the need of a separate library? Or would Shapely want to use PyGEOS as a dependency for its interaction with the GEOS library instead of its ctypes approach? Or if keeping separate libraries, how do ensure a best possible interoperability?
Below, I am trying to summarize some of the discussion points, partly from my (biased) viewpoint as GeoPandas developer.
Stumbling blocks / disadvantages for moving pygeos into Shapely:
- It introduces an additional (hard) dependency on numpy
- It will make part of the codebase more complex / less approachable for python programmers (the core GEOS wrappers are written in C in pygeos), and will require a C compiler to install from source.
- It expands the scope of Shapely (array functions instead of only scalars)
- It will require a decent effort (and resources are finite), leaving less time of other potential new features (fixed precision model?)
Advantages of integrating pygeos into Shapely:
- Expanding a familiar library for users instead of introducing a new one
- No split in the python geos community: having a single Geometry object in our ecosystem instead of two not-fully-interoperable Geometry objects
- Removing the duplicate implementation of a minor part of shapely (the faster cython functions vs the main ctypes based functionality)
- Improving the performance of direct Shapely users
And a few aspects that are less clear advantage/disadvantage:
- Getting rid of ctypes use in Shapely. On the one hand, this keeps Shapely itself pure python and less complex, but on the other hand also has disadvantages I understood (eg related to dynamic loading of GEOS?)
- Fixing small inconsistencies / design issues in Shapely. On the one hand, this will introduce some behaviour changes for the user. But on the other hand, if done well (with proper deprecation + major version bump), it could clean up some inconsistencies in Shapely (eg mutability of geometries, inconsistencies in emtpy geometries, cfr https://github.com/Toblerity/Shapely/issues/742)
Or, if not directly integrating the full functionality of PyGEOS into Shapely, are there ways to integrate the Geometry type better and ensure good interoperability / exchangeability ?
- It could probably made to work that PyGEOS functions also work on Shapely geometries, with limited changes in Shapely. For PyGEOS, the pointer to the GEOS object that is held in a python Geometry object needs to be accessible from C as a static attribute of the Python object (an attribute of the C struct that makes up a Python object). Such a basic extension type with those features could be added in Shapely, while for the rest keeping the implementation in python. However, that would already introduce the requirement of needing a C compiler to install Shapely from source. It is also not fully clear to me if it is safe to share pointers between both packages if they use a GEOS that is potentially built with a different toolchain.
- While receiving Shapely objects in PyGEOS functions could potentially work (as the above), this still introduces the discrepancy that PyGEOS functions that return new geometries give a different kind of geometry objects (this base extension type) as what they accepted as input.
- For the broader ecosystem, interoperability could use the
__geo_interface__
. However, not all packages will support this to recognize geometries (I suppose many packages expect Shapely objects, and actually Shapely itself also does not support it as input in the Geometry methods), and this also has a lot of overhead. I would argue that this is mainly useful for packages that explicitly do no want to depend on GEOS (through Shapely or PyGEOS).
For me, my main concern is to have a clear story for the community of both end user and dependent packages. Should they use / support shapely or pygeos geometries? What if I am using pygeos geometries but then another package I want to use only supports shapely geometries? For that reason, I am personally hoping we can find some way to integrate both projects.
And a similar viewpoint from the GeoPandas side: we are planning to use PyGEOS under the hood to store the geometries in the geometry column and have faster operations on it. However, we are not planning to “get rid of” Shapely. For now, Shapely will still be the primary user-facing access to elements of a GeoDataFrame: when accessing a single element, users get a Shapely object (the stored pygeos object gets converted) to benefit from the rich functionality of Shapely scalar objects. This means that whathever inconsistency there is between PyGEOS and Shapely will be confusing for GeoPandas users. It means that there will be two ways of getting the geometries out of a dataframe: as an array of shapely objects or an array of pygeos objects. All that is not an ideal situation from a GeoPandas viewpoint.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:29
- Comments:24 (19 by maintainers)
Top GitHub Comments
We’ve had a pair of video calls to discuss these issues. Thank you @jorisvandenbossche and @caspervdw for organizing and thank you @snowman2 @snorfalorpagus and @mwtoews for joining us. Our consensus is to go ahead with the 4th option: replacing the existing vectorized module in shapely with code from pygeos and expanding the scope for vectorized operations. It is yet to be decided if shapely’s existing scalar operations will be a special case of more general vectorized operations. Also yet to be determined is a roadmap for developing and releasing.
Next steps: let’s spread the word about this and see if anyone else sees a flaw in the plan. Then let’s work on the roadmap and try to make some progress on this after the holidays.
Hi @jorisvandenbossche, I’m currently drowning in shapely issues in the 1.7 (master) branch, we’ve got a burst of them, and haven’t been able to get to this yet. I’ll respond to that comment above later today or tomorrow.