question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[cython] specific case where new sjoin is much slower

See original GitHub issue

@andreas-h reported a use case where the sjoin from the geopandas-cython branch is much slower than the current released version: https://gist.github.com/andreas-h/4906aea5d8ecffc9751e191cd11d00b4

I ran it locally and I can confirm this. It is joining 20,000 points with 44,000 polygons (this only takes ca 5s on master, but 30-60s on the cython branch).

I tried to profile it, but it seems to indicate that virtually all time is spent within the cython cysjoin function (and thus c sjoin fucntion). Which is also strange because also the actual pandas code in the user-facing sjoin function should take some time. I did not yet check that the actual results of both versions are the same; possibly one of both implementations is doing something wrong.

cc @mrocklin

@andreas-h could you simplify the example a little bit? (to not depend on the emiprepr library, eg just construct the polygons directly inside the notebook)

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:17 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
adriangbcommented, Mar 25, 2020

I re-ran these tests (gist), I’m posting here as well as in #1344 to try and give some closure to this issue.

Namely, I added PyGEOS which also uses GEOS’ STRTree but different Python binding and geometry data structures: build query

So it seems to me that most of the slowdown comes from Shapely/Python stuff, not GEOS.

0reactions
jorisvandenbosschecommented, Mar 26, 2020

@adriangb thanks for testing! Hmm, something must have been wrong in the old c/cython implementation of this (although it is using almost the same code / approach as what we have in pygeos now).

But I can confirm your findings, as I also ran my original notebook, and the sjoin on those data in master now takes around 4s, and doing the bulk query with pygeos takes less than 200ms (while this took 30s in the notebook with the old c/cython implementation):

In [12]: %time joined = geopandas.sjoin(rgeoms, grid, op='within')
CPU times: user 3.43 s, sys: 3.05 ms, total: 3.44 s
Wall time: 3.45 s

In [13]: %%timeit
    ...: tree = pygeos.STRtree(array_grid)
    ...: idx1, idx2= tree.query_bulk(array_rgeoms, predicate="within")
179 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The bulk query here still needs to index the dataframes and merge them, to be equivalent to the sjoin, but that’s not very expensive (not seconds, at least).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why assign values to C-contiguous array is slow in my case ...
I got a problem of assigning temporary results to an array using Cython. Here I declare a test_array , sample-size and weight_array ,...
Read more >
slow Cython code: please advise : r/Python - Reddit
I'm trying to implement a pure-Python function in Cython. ... use is simply not very c-like: "dindex.append([i,j])" for instance is always going to...
Read more >
Cythonizing Genetic Algorithms: 18x Faster - Paperspace Blog
Here we inspect a Python implementation of the genetic algorithm to reduce computation time using Cython. The result? Over 18x faster code.
Read more >
redesign and improved performance using Cython
In this blogpost I explain the latest developments in the GeoPandas package. ... and perform a spatial join with the districts dataframe:.
Read more >
Fast, Flexible, Easy and Intuitive: How to Speed Up Your ...
(Note that you could alternatively use a Pandas PeriodIndex in this case.) ... This will result in the creation of a new column...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found