GeoPandas performance: optimizing vectorized operations
See original GitHub issueSTATUS UPDATE: work on vectorized operations has been done in the PyGEOS package (to be integrated into Shapely), and optional support for this has landed in geopandas master in the meantime. See https://github.com/geopandas/geopandas/issues/1155, and https://geopandas.readthedocs.io/en/latest/install.html#using-the-optional-pygeos-dependency on how to test this.
There have been some issues recently about performance of geopandas (eg https://github.com/geopandas/geopandas/issues/396), and once you have somewhat larger datasets this can be a problem. On the one hand, there are certainly algorithmic optimizations possible (more performant algorithm, better use of spatial indices, etc. For example the current work on overlay https://github.com/geopandas/geopandas/pull/338, https://github.com/geopandas/geopandas/pull/429). But, even then the basic operations can still be slow, making it difficult to ever reach really performant code.
The main problem with geopandas’ performance is that all vectorized operations in geopandas are just wrappers around for-loops / list comprehensions. For example:
geoseries.distance(point)
just calls the distance method of the underlying shapely objects many times in a python loop.
So the question then is, can’t we do this for loop on a lower level. Shapely is a wrapper around the GEOS C library using ctypes (and partly cython), so it could be possible to do the loops on this lower level. Or with some of the other tools (cython, cffi, numba, …) to push down the for loop to a lower level, the same way as the performant operations in pandas happen.
So I thought to just try it out to explore what would be possible, and I tried it with cython (based on what is available in shapely of cython wrappers). Result of this experimentation can be seen here: http://nbviewer.jupyter.org/gist/jorisvandenbossche/2e38c8ae14d273d9d7b117318e23c7ed
Summary, I get a speed-up of ca 25x (and 250x when using integer _geom pointers) on this specific example.
It certainly shows that large speed-ups on basic operations are possible. But, I am not an expert in both cython or geos, so I am not sure this is a good approach. But the way I did it here for ‘contains’ can certainly be expanded to the other methods, and it seems something worth to explore further.
Some questions
-
Is this is a good approach? Is the way I use Cython here a good approach? (apart from that it can be cleaned-up by a more knowledgeable cython coder). Would something else be more appropriate / easier (numba, ctypes, cffi, …)
-
Is it generally safe to use those integers pointers? (
_geomattribute) -
Where would this functionality belong? (cc @sgillies)
- There is already a
shapely.vectorizedmodule, and I think such vectorized operations for other methods would suite there as well (with the only difference that the current methods inshapely.vectorizeddo not work with arrays of geomtries, but with x and y arrays). - Including it in shapely would make it easier to distribute (it already has all machinery to handle the deps and compilation etc), but would possibly expand scope. Including it in geopandas would make it easier to experiment/expand things as needed, but would make the installation/maintenance harder, and duplicating code from shapely.
- There is already a
-
Are there people who would like to collaborate on this?
Issue Analytics
- State:
- Created 6 years ago
- Reactions:4
- Comments:25 (20 by maintainers)

Top Related StackOverflow Question
TLDR: we are working on the
pygeospackage that provides vectorized GEOS operations and will provide a performance boost to GeoPandas.An update on this (after a while …): although I updated the cython branch earlier this summer (https://github.com/geopandas/geopandas/pull/1030) to bring it up to date with the refactored internals in version 0.6.0 (based on ExtensionArrays, this refactor already brought parts of the cython branch into master), effort has now switched to the
pygeospackage (https://github.com/caspervdw/pygeos/), which will provide the same speed-ups but is using a different (and more general) approach. It was a happy coincidence that @caspervdw started working on this package at more or less the same time that I started looking again at this performance issue in geopandas!Short summary of
pygeos: it provides vectorized GEOS operations on numpy arrays of geometries, mostly using the numpy ufunc machinery. This means that we can use it in GeoPandas for performing the operations on the 1D geometry column, but it can also work in general on numpy arrays and make use of numpy broadcasting etc (eg to automatically calculate all intersections or distances between combinations of 2 arrays giving a 2D array as result). It implements a lightweight geometry python extension type wrapping a GEOSGeometry pointer (which knows how to deallocate itself). This makes it more robust as storing the raw pointers as in the cython appraoch, while the python extension type’s struct with the pointer is still accessible from C/Cython without much overhead.So, we can use pygeos in GeoPandas to achieve a performance boost on the geospatial operations.
Some of the questions that have been discussed above are still relevant, such as: does this belong in Shapely itself, or outside (now in pygeos instead of geopandas)? If not, how do we ensure compatibility (both on a packaging level, as for user interfacing geometry objects) ? That are still aspects that need be further discussed.
One specific aspect for geopandas to discuss is how to integrate with pygeos initially (require it, do it optional for now, …), but will open a specific issue for that.
This is interesting. There’d be a lot of benefit in vectorizing or compiling those geopandas loops. For instance: I had to build a lot of hack-y spatial queries into OSMnx to speed up my spatial operations for working with millions of points and complex polygons.
It would be interesting to benchmark a numba solution against your cython solution, as it might be simpler to implement. You’d presumably have to fall back on numpy x-y arrays instead of shapely geometries to get the functions to compile with numba (since it only plays nicely with scalars and arrays). But new versions of numba might be able to compile classes.
For what it’s worth, the existing shapely vectorized module alone provides an order of magnitude speed increase over the python-loops implementation in your example code:
Loops:
Vectorized:
But integrating this with a spatial index probably opens a whole new can of worms.