Performance of a direct vs sequential union_all
See original GitHub issueHi,
I have to dissolve a large number (millions) of dense polygons and trying to figure a way how to parallelize it. During that, I have encountered a strange behaviour of union_all
performance (which I guess is inherited from GEOS).
It seems that using union_all
directly on the whole array is the slowest possible option. If I do the union of parts and that merge parts together (which results in the same polygon) I can get there up to 15x faster (and maybe even more). Consider the example below, where I check different sizes of chunks for the initial sequential union.
With large arrays, it may be even worth doing the chunk-based union twice in a cascaded way. What puzzles me is why is that and why something like this is not implemented directly in GEOS since it can provide a significant performance boost. Do you have any idea? And any idea how to come with a heuristic for this (which can be then parallelised)? Even splitting the array into two parts can give 2x speedup. I haven’t found anything describing this behaviour.
If we figure out a way how to manage this we can even reimplement it as a custom performant version of union_all
.
The time to do pygeos.union_all(geoms)
directly is 131s.
from time import time
import geopandas as gpd
import pygeos
sample = gpd.read_parquet("https://www.dropbox.com/s/oxy2h2pb0m3no5d/sample.pq?dl=1")
geoms = sample.geometry.values.data
times = pd.DataFrame(columns=['time'])
size = geoms.shape[0]
for split in [2, 5, 10, 25, 50, 100, 150, 200, 350, 500, 1000, 1500]:
s = time()
unoined = pygeos.union_all(
[pygeos.union_all(geoms[edge : edge + split]) for edge in range(0, size, split)]
)
times.loc[split] = time() - s
times.plot(xlabel="split size", ylabel='time')
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (10 by maintainers)
Top GitHub Comments
Wow! That is insane. Just got the same result here. That means ~40x speedup between 3.8.1 and 3.9. I guess my today’s effort, although interesting is no longer relevant 😄.
I am going to close this issue as resolved by GEOS 3.9. There is still some difference but it is not stable (I get a slightly different outcome every run) and it oscillates in ± 20% margin. Thank you, everyone, for a fascinating discussion!