question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance of a direct vs sequential union_all

See original GitHub issue

Hi,

I have to dissolve a large number (millions) of dense polygons and trying to figure a way how to parallelize it. During that, I have encountered a strange behaviour of union_all performance (which I guess is inherited from GEOS).

It seems that using union_all directly on the whole array is the slowest possible option. If I do the union of parts and that merge parts together (which results in the same polygon) I can get there up to 15x faster (and maybe even more). Consider the example below, where I check different sizes of chunks for the initial sequential union.

With large arrays, it may be even worth doing the chunk-based union twice in a cascaded way. What puzzles me is why is that and why something like this is not implemented directly in GEOS since it can provide a significant performance boost. Do you have any idea? And any idea how to come with a heuristic for this (which can be then parallelised)? Even splitting the array into two parts can give 2x speedup. I haven’t found anything describing this behaviour.

If we figure out a way how to manage this we can even reimplement it as a custom performant version of union_all.

The time to do pygeos.union_all(geoms) directly is 131s.

from time import time
import geopandas as gpd
import pygeos

sample = gpd.read_parquet("https://www.dropbox.com/s/oxy2h2pb0m3no5d/sample.pq?dl=1")

geoms = sample.geometry.values.data

times = pd.DataFrame(columns=['time'])

size = geoms.shape[0]

for split in [2, 5, 10, 25, 50, 100, 150, 200, 350, 500, 1000, 1500]:
    s = time()
    unoined = pygeos.union_all(
        [pygeos.union_all(geoms[edge : edge + split]) for edge in range(0, size, split)]
    )
    times.loc[split] = time() - s

times.plot(xlabel="split size", ylabel='time')

Screenshot 2021-02-02 at 17 22 30

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
martinfleiscommented, Feb 2, 2021

Wow! That is insane. Just got the same result here. That means ~40x speedup between 3.8.1 and 3.9. I guess my today’s effort, although interesting is no longer relevant 😄.

0reactions
martinfleiscommented, Feb 3, 2021

I am going to close this issue as resolved by GEOS 3.9. There is still some difference but it is not stable (I get a slightly different outcome every run) and it oscillates in ± 20% margin. Thank you, everyone, for a fascinating discussion!

Read more comments on GitHub >

github_iconTop Results From Across the Web

OR vs UNION ALL - Is One Better For Performance?
Today I want to show you a trick that could make your queries run faster. It won't always work, but when it does...
Read more >
MySQL UNION ALL vs muliple SELECT performance on large ...
It depends. First of all, under 10ms is so small in MySQL queries that it is hardly worth debating or comparing.
Read more >
sql server - How to make a union view execute more efficiently?
Typically a table scan makes exensive use of sequential I/O, which is much faster than random access reads. Often, if a query would...
Read more >
Techniques for improving the performance of SQL queries ...
DISTINCT and UNION operators cause sorting, which slows down the SQL execution. Use UNION ALL instead of UNION, if possible, as it is...
Read more >
Chapter 4. Query Performance Optimization - O'Reilly
A plan that reads more pages might actually be cheaper in some cases, such as when the reads are sequential so the disk...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found