Integrating pygeos in GeoPandas for vectorized array operations
See original GitHub issueFor context, see https://github.com/geopandas/geopandas/issues/430. pygeos
(https://github.com/caspervdw/pygeos/) is a new package providing all GEOS functionality as vectorized functions operating on numpy arrays.
We can use this in GeoPandas to replace the python loops over shapely objects, to provide a considerable performance boost (similar to the timings shown with the cython branch). See https://github.com/geopandas/geopandas/pull/1154 for a proof of concept.
Notable things:
pygeos
has its own lightweight Geometry object, and it is an array of those that we store under the hood in a GeometryArray instead of an array of shapely objects- For me, the idea is that this is (for now at least) mostly hidden for the user, and the public interface dealing with scalar geometry objects (eg when accessing a single element from a GeoSeries) still uses the familiar, feature-rich shapely object. This means that upon access, the pygeos Geometry is converted to a Shapely geometry.
- My proof of concept PR (https://github.com/geopandas/geopandas/pull/1154) passes all our existing tests (the only change I needed to make was changing an identity check into an equality check (as accessing a single object each time gives a new shapely object, see above)). So in theory, this should be almost fully backwards compatible.
But some questions that we need to discuss:
- Are we OK with a hard requirements on pygeos, or do we keep the current implementation as fallback? (eg only use pygeos if it is installed)
- Given the relatively small diff in https://github.com/geopandas/geopandas/pull/1154, and the fact that the behaviour is almost the same, it seems possible to do this opt-in (or at least initially). But it of course adds complexity as then there are multiple implementations to maintain, so it is not my preferred solution (long term).
- Do we already want to use it now, or do we want to wait until the situation between shapely and pygeos gets cleared up? (we are still discussing to what extent it could be integrated in shapely) Given this uncertainty, that might be a reason to go for the opt-in solution for now.
Thoughts / concerns / questions about this topic?
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (14 by maintainers)
Top Results From Across the Web
pygeos documentation - Read the Docs
PyGEOS is a C/Python library with vectorized geometry functions. The geometry operations are done in the open-source geometry library GEOS.
Read more >Ecosystem — GeoPandas 0.12.2+0.gefcb367.dirty ...
PyGEOS is a C/Python library with vectorized geometry functions. ... ufuncs providing a performance improvement when operating on arrays of geometries.
Read more >Introducing PyGEOS - Casper van der Wel
Arrays of geometries can be operated on with almost zero Python ... PyGEOS aims to provide vectorized geospatial operations to the Python ...
Read more >The Best Features of Geopandas 0.8.0 Release | by Abdishakur
I like using Geopandas for my Geospatial data science projects. ... PyGEOS is a C/Python library with vectorized geometry functions.
Read more >pygeos 0.10 - PyPI
PyGEOS is a C/Python library with vectorized geometry functions. ... a performance improvement when operating on arrays of geometries.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I created a dev package on conda-forge, making it a little bit easier to install and test the 0.8.0.dev version with those changes (although since geopandas is pure-python, installing from git master is also not hard):
gives you pygeos and the dev version of geopandas, so that pygeos should be used by geopandas.
Full switch! The maintainer community for geopandas is small enough I don’t think doubled implementations make sense / is feasible. No one likes dependencies, but geopandas will never be lightweight anyway – I vote for accepting the dependencies in exchange for making geopandas easier to maintain / improve.