Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GeoPandas performance: optimizing vectorized operations

See original GitHub issue

STATUS UPDATE: work on vectorized operations has been done in the PyGEOS package (to be integrated into Shapely), and optional support for this has landed in geopandas master in the meantime. See https://github.com/geopandas/geopandas/issues/1155, and https://geopandas.readthedocs.io/en/latest/install.html#using-the-optional-pygeos-dependency on how to test this.

There have been some issues recently about performance of geopandas (eg https://github.com/geopandas/geopandas/issues/396), and once you have somewhat larger datasets this can be a problem. On the one hand, there are certainly algorithmic optimizations possible (more performant algorithm, better use of spatial indices, etc. For example the current work on overlay https://github.com/geopandas/geopandas/pull/338, https://github.com/geopandas/geopandas/pull/429). But, even then the basic operations can still be slow, making it difficult to ever reach really performant code.

The main problem with geopandas’ performance is that all vectorized operations in geopandas are just wrappers around for-loops / list comprehensions. For example:

geoseries.distance(point)

just calls the distance method of the underlying shapely objects many times in a python loop.

So the question then is, can’t we do this for loop on a lower level. Shapely is a wrapper around the GEOS C library using ctypes (and partly cython), so it could be possible to do the loops on this lower level. Or with some of the other tools (cython, cffi, numba, …) to push down the for loop to a lower level, the same way as the performant operations in pandas happen.

So I thought to just try it out to explore what would be possible, and I tried it with cython (based on what is available in shapely of cython wrappers). Result of this experimentation can be seen here: http://nbviewer.jupyter.org/gist/jorisvandenbossche/2e38c8ae14d273d9d7b117318e23c7ed Summary, I get a speed-up of ca 25x (and 250x when using integer _geom pointers) on this specific example.

It certainly shows that large speed-ups on basic operations are possible. But, I am not an expert in both cython or geos, so I am not sure this is a good approach. But the way I did it here for ‘contains’ can certainly be expanded to the other methods, and it seems something worth to explore further.

Some questions

Is this is a good approach? Is the way I use Cython here a good approach? (apart from that it can be cleaned-up by a more knowledgeable cython coder). Would something else be more appropriate / easier (numba, ctypes, cffi, …)
Is it generally safe to use those integers pointers? (_geom attribute)
Where would this functionality belong? (cc @sgillies)
- There is already a shapely.vectorized module, and I think such vectorized operations for other methods would suite there as well (with the only difference that the current methods in shapely.vectorized do not work with arrays of geomtries, but with x and y arrays).
- Including it in shapely would make it easier to distribute (it already has all machinery to handle the deps and compilation etc), but would possibly expand scope. Including it in geopandas would make it easier to experiment/expand things as needed, but would make the installation/maintenance harder, and duplicating code from shapely.
Are there people who would like to collaborate on this?

cc @ozak @kuanb @gboeing

Issue Analytics

State:
Created 6 years ago
Reactions:4
Comments:25 (20 by maintainers)

Top GitHub Comments

6reactions

jorisvandenbosschecommented, Oct 12, 2019

TLDR: we are working on the pygeos package that provides vectorized GEOS operations and will provide a performance boost to GeoPandas.

An update on this (after a while …): although I updated the cython branch earlier this summer (https://github.com/geopandas/geopandas/pull/1030) to bring it up to date with the refactored internals in version 0.6.0 (based on ExtensionArrays, this refactor already brought parts of the cython branch into master), effort has now switched to the pygeos package (https://github.com/caspervdw/pygeos/), which will provide the same speed-ups but is using a different (and more general) approach. It was a happy coincidence that @caspervdw started working on this package at more or less the same time that I started looking again at this performance issue in geopandas!

Short summary of pygeos: it provides vectorized GEOS operations on numpy arrays of geometries, mostly using the numpy ufunc machinery. This means that we can use it in GeoPandas for performing the operations on the 1D geometry column, but it can also work in general on numpy arrays and make use of numpy broadcasting etc (eg to automatically calculate all intersections or distances between combinations of 2 arrays giving a 2D array as result). It implements a lightweight geometry python extension type wrapping a GEOSGeometry pointer (which knows how to deallocate itself). This makes it more robust as storing the raw pointers as in the cython appraoch, while the python extension type’s struct with the pointer is still accessible from C/Cython without much overhead.

So, we can use pygeos in GeoPandas to achieve a performance boost on the geospatial operations.

Some of the questions that have been discussed above are still relevant, such as: does this belong in Shapely itself, or outside (now in pygeos instead of geopandas)? If not, how do we ensure compatibility (both on a packaging level, as for user interfacing geometry objects) ? That are still aspects that need be further discussed.

One specific aspect for geopandas to discuss is how to integrate with pygeos initially (require it, do it optional for now, …), but will open a specific issue for that.

3reactions

gboeingcommented, Apr 2, 2017

This is interesting. There’d be a lot of benefit in vectorizing or compiling those geopandas loops. For instance: I had to build a lot of hack-y spatial queries into OSMnx to speed up my spatial operations for working with millions of points and complex polygons.

It would be interesting to benchmark a numba solution against your cython solution, as it might be simpler to implement. You’d presumably have to fall back on numpy x-y arrays instead of shapely geometries to get the functions to compile with numba (since it only plays nicely with scalars and arrays). But new versions of numba might be able to compile classes.

For what it’s worth, the existing shapely vectorized module alone provides an order of magnitude speed increase over the python-loops implementation in your example code:

import numpy as np, shapely.vectorized as sv
from shapely.geometry import Point, Polygon
polygon = Polygon([(10,10), (10,100), (100,100), (100, 10)])
xy = np.array([(n, n) for n in range(10000)])
points = [Point(x, y) for x, y in xy]

Loops:

def contains_py(polygon, points):
    return np.array([polygon.contains(point) for point in points])
%timeit contains_py(polygon, points)

10 loops, best of 3: 40.1 ms per loop

Vectorized:

%timeit sv.contains(polygon, x=xy[:,0], y=xy[:,1])

100 loops, best of 3: 3.09 ms per loop

But integrating this with a spatial index probably opens a whole new can of worms.

Top Results From Across the Web

GeoPandas performance: optimizing vectorized operations

just calls the distance method of the underlying shapely objects many times in a python loop. So the question then is, can't we...

Pandas: How You Can Speed Up 50x+ Using Vectorized ...

Today we want to demonstrate how you can vectorize your Pandas code and compare the speed performance of each operation.

Ecosystem — GeoPandas 0.12.2+0.gefcb367.dirty ...

PyGEOS is a C/Python library with vectorized geometry functions. ... PyGEOS wraps these operations in NumPy ufuncs providing a performance improvement when ...

Vector Data Processing using Python Tools: GeoPandas ...

Learn to use GeoPandas by reading from common vector geospatial formats (shape files, GeoJSON, etc), PostGIS databases, and from geospatial data generated on ......

Comprehensive Guide To Optimize Your Pandas Code

As a reminder, vectorization is a process of executing operations on entire arrays. Pandas/NumPy/SciPy includes a generous collection of vectorized functions ...