question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update setting data pointers for Cython 3

See original GitHub issue

Need to update the following locations

  • _libs/window/aggregations.pyx: bufarr.data
  • _libs/reduction.pyx: chunk.data

https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=34888&view=logs&j=33ccdd52-c922-5ef2-8209-78215e36d994&t=046ffbec-62c2-54e3-e88a-7745f4292bb6&l=457

just started failing.

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

if anyone has insights

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:34 (33 by maintainers)

github_iconTop GitHub Comments

1reaction
tacaswellcommented, Sep 11, 2021

Well, “rip it all out” is one way to fix it !

1reaction
tacaswellcommented, Nov 14, 2020

I spent some more time looking at what this is doing today and came up with the following notes (I apologize if I am saying really obvious things)

The Slider and BlockSlider classes are implementing views into a numpy array by:

  • taking in two arrays
  • stashing the pointer to one of them
  • on demand mutating the second array to point at the (offset) pointer to the first and adjusting the strides to give the window
  • on the way out the old guts of the second array are put back

The mutated array is then used to update “cached objects” in their calling class by updating the pandas side block manager details. I suspect that this is the source of the stats model issues mentioned above as the code is aggressively changing things underneath the eventual user-exposed objects.

The change that has broken things is than cython now disallows relpacing the guts of a numpy array (which seem fair!). My guess is that the performance gains come from both not memory thrashing and not falling back to the python layer. The cython docs says that when you do [] on a numpy array it falls back to python (I assume because the inputs are too variable) which is probably the source of the major performance regressions.

I am not super clear how the numpy nbiter interface works, but it looks like it is focused on getting an iterator over single elements, or at least fixed steps through the array, where as for this code we need iteration over variable size windows.

It looks like the way to do this with memory views ( https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html#efficient-indexing-with-memoryviews ) but those seem to require knowing what the type is up front.

My suspicion is that the right solution here is to do something like what @mattip suggested above and in def move use the pointers we have to the underlying data and fabricate new numpy arrays of just the sub-section that is needed.

These classes appear to only be used internally to the reduction module so I do not think there are any back-compatibility with completely re-writing them.

attn @scoder for guidance on which of these methods (or one I do not see) is the best path.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Language Basics — Cython 3.0.0a11 documentation
The C code uses a variable which is a pointer to a structure of the specific type, something like struct MyExtensionTypeObject* . Here...
Read more >
Cython modifiy pointer from c++ - python - Stack Overflow
This is a basic C pointer passing problem. str and s in Python both point to the same place, but they are different...
Read more >
Speedup your existing Python project with Cython +30x
Step 3: Create the setup file .py and point to the .pyx file; Step 4: Go to the setup.py directory and run the...
Read more >
Best Practices for passing numpy data pointer to C ?
to cython-users. Hi folks, We need to be able to pass the data pointer from a numpy array to C -- so that...
Read more >
Accelerating Python on GPUs with nvc++ and Cython
This is because the GPU can only access data that is allocated in code compiled by nvc++ and the -stdpar option. In this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found