Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update setting data pointers for Cython 3

See original GitHub issue

Need to update the following locations

_libs/window/aggregations.pyx: bufarr.data
_libs/reduction.pyx: chunk.data

https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=34888&view=logs&j=33ccdd52-c922-5ef2-8209-78215e36d994&t=046ffbec-62c2-54e3-e88a-7745f4292bb6&l=457

just started failing.

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

if anyone has insights

Issue Analytics

State:
Created 3 years ago
Comments:34 (33 by maintainers)

Top GitHub Comments

1reaction

tacaswellcommented, Sep 11, 2021

Well, “rip it all out” is one way to fix it !

1reaction

tacaswellcommented, Nov 14, 2020

I spent some more time looking at what this is doing today and came up with the following notes (I apologize if I am saying really obvious things)

The Slider and BlockSlider classes are implementing views into a numpy array by:

taking in two arrays
stashing the pointer to one of them
on demand mutating the second array to point at the (offset) pointer to the first and adjusting the strides to give the window
on the way out the old guts of the second array are put back

The mutated array is then used to update “cached objects” in their calling class by updating the pandas side block manager details. I suspect that this is the source of the stats model issues mentioned above as the code is aggressively changing things underneath the eventual user-exposed objects.

The change that has broken things is than cython now disallows relpacing the guts of a numpy array (which seem fair!). My guess is that the performance gains come from both not memory thrashing and not falling back to the python layer. The cython docs says that when you do [] on a numpy array it falls back to python (I assume because the inputs are too variable) which is probably the source of the major performance regressions.

I am not super clear how the numpy nbiter interface works, but it looks like it is focused on getting an iterator over single elements, or at least fixed steps through the array, where as for this code we need iteration over variable size windows.

It looks like the way to do this with memory views ( https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html#efficient-indexing-with-memoryviews ) but those seem to require knowing what the type is up front.

My suspicion is that the right solution here is to do something like what @mattip suggested above and in def move use the pointers we have to the underlying data and fabricate new numpy arrays of just the sub-section that is needed.

These classes appear to only be used internally to the reduction module so I do not think there are any back-compatibility with completely re-writing them.

attn @scoder for guidance on which of these methods (or one I do not see) is the best path.