Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of a direct vs sequential union_all

See original GitHub issue

Hi,

I have to dissolve a large number (millions) of dense polygons and trying to figure a way how to parallelize it. During that, I have encountered a strange behaviour of union_all performance (which I guess is inherited from GEOS).

It seems that using union_all directly on the whole array is the slowest possible option. If I do the union of parts and that merge parts together (which results in the same polygon) I can get there up to 15x faster (and maybe even more). Consider the example below, where I check different sizes of chunks for the initial sequential union.

With large arrays, it may be even worth doing the chunk-based union twice in a cascaded way. What puzzles me is why is that and why something like this is not implemented directly in GEOS since it can provide a significant performance boost. Do you have any idea? And any idea how to come with a heuristic for this (which can be then parallelised)? Even splitting the array into two parts can give 2x speedup. I haven’t found anything describing this behaviour.

If we figure out a way how to manage this we can even reimplement it as a custom performant version of union_all.

The time to do pygeos.union_all(geoms) directly is 131s.

from time import time
import geopandas as gpd
import pygeos

sample = gpd.read_parquet("https://www.dropbox.com/s/oxy2h2pb0m3no5d/sample.pq?dl=1")

geoms = sample.geometry.values.data

times = pd.DataFrame(columns=['time'])

size = geoms.shape[0]

for split in [2, 5, 10, 25, 50, 100, 150, 200, 350, 500, 1000, 1500]:
    s = time()
    unoined = pygeos.union_all(
        [pygeos.union_all(geoms[edge : edge + split]) for edge in range(0, size, split)]
    )
    times.loc[split] = time() - s

times.plot(xlabel="split size", ylabel='time')

Screenshot 2021-02-02 at 17 22 30

Issue Analytics

State:
Created 3 years ago
Comments:12 (10 by maintainers)

Top GitHub Comments

1reaction

martinfleiscommented, Feb 2, 2021

Wow! That is insane. Just got the same result here. That means ~40x speedup between 3.8.1 and 3.9. I guess my today’s effort, although interesting is no longer relevant 😄.

0reactions

martinfleiscommented, Feb 3, 2021

I am going to close this issue as resolved by GEOS 3.9. There is still some difference but it is not stable (I get a slightly different outcome every run) and it oscillates in ± 20% margin. Thank you, everyone, for a fascinating discussion!