[cython] specific case where new sjoin is much slower
See original GitHub issue@andreas-h reported a use case where the sjoin from the geopandas-cython branch is much slower than the current released version: https://gist.github.com/andreas-h/4906aea5d8ecffc9751e191cd11d00b4
I ran it locally and I can confirm this. It is joining 20,000 points with 44,000 polygons (this only takes ca 5s on master, but 30-60s on the cython branch).
I tried to profile it, but it seems to indicate that virtually all time is spent within the cython cysjoin function (and thus c sjoin fucntion). Which is also strange because also the actual pandas code in the user-facing sjoin function should take some time. I did not yet check that the actual results of both versions are the same; possibly one of both implementations is doing something wrong.
cc @mrocklin
@andreas-h could you simplify the example a little bit? (to not depend on the emiprepr library, eg just construct the polygons directly inside the notebook)
Issue Analytics
- State:
- Created 6 years ago
- Comments:17 (16 by maintainers)

Top Related StackOverflow Question
I re-ran these tests (gist), I’m posting here as well as in #1344 to try and give some closure to this issue.
Namely, I added PyGEOS which also uses GEOS’ STRTree but different Python binding and geometry data structures:

So it seems to me that most of the slowdown comes from Shapely/Python stuff, not GEOS.
@adriangb thanks for testing! Hmm, something must have been wrong in the old c/cython implementation of this (although it is using almost the same code / approach as what we have in pygeos now).
But I can confirm your findings, as I also ran my original notebook, and the
sjoinon those data in master now takes around 4s, and doing the bulk query with pygeos takes less than 200ms (while this took 30s in the notebook with the old c/cython implementation):The bulk query here still needs to index the dataframes and merge them, to be equivalent to the
sjoin, but that’s not very expensive (not seconds, at least).