bug: invalid values returned by .first().over(w) or .last().over(w) when using the pandas backend
See original GitHub issueMinimal reproducer:
import ibis
import pandas as pd
df = pd.DataFrame(
{
"g": ["a", "a", "a", "a", "a"],
"x": [0, 1, 2, 3, 4],
"y": [3, 2, 0, 1, 1],
}
)
df.to_parquet("test.parquet")
t_pandas = ibis.pandas.connect({"t": df}).table("t")
t_duckdb = ibis.duckdb.connect().register("test.parquet", table_name="t")
def simple_window_ops(t):
w = ibis.window(
group_by=t.g,
order_by=[t.x, t.y],
preceding=1,
following=0,
)
return t.mutate(
x_first=t.x.first().over(w),
x_last=t.x.last().over(w),
y_first=t.y.first().over(w),
y_last=t.y.last().over(w),
)
Then pandas does not seem to take the preceding
and following
window boundaries into account:
>>> print(simple_window_ops(t_pandas).execute())
g x y x_first x_last y_first y_last
0 a 0 3 0 4 3 1
1 a 1 2 0 4 3 1
2 a 2 0 0 4 3 1
3 a 3 1 0 4 3 1
4 a 4 1 0 4 3 1
while duckdb works as expected:
>>> print(simple_window_ops(t_duckdb).execute())
g x y x_first x_last y_first y_last
0 a 0 3 0 0 3 3
1 a 1 2 0 1 3 2
2 a 2 0 1 2 2 0
3 a 3 1 2 3 0 1
4 a 4 1 3 4 1 1
Issue Analytics
- State:
- Created a year ago
- Comments:8 (7 by maintainers)
Top Results From Across the Web
pandas.DataFrame.query — pandas 1.5.2 documentation
Query the columns of a DataFrame with a boolean expression. ... See the documentation for eval() for complete details on the keyword arguments...
Read more >Working with Missing Data in Pandas - GeeksforGeeks
Checking for missing values using isnull() In order to check null values in Pandas DataFrame, we use isnull() function this function return ......
Read more >How to Filter Rows in Pandas: 6 Methods to Power Data ...
Filtering rows in pandas removes extraneous or incorrect data so you are left with the cleanest data set available. You can filter by...
Read more >PySpark Usage Guide for Pandas with Apache Arrow
If an error occurs during createDataFrame() , Spark will fall back to create the DataFrame without Arrow. Pandas UDFs (a.k.a. Vectorized UDFs). Pandas...
Read more >5 PL/SQL Collections and Records - Oracle Help Center
Associative arrays help you represent data sets of arbitrary size, with fast ... CourseList() , which returns a nested table containing those elements:...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for reporting this, @ogrisel. This is definitely two bugs. We’ll try to get pandas to return valid results, or at the very least, raise a meaningful error if it can’t.
BTW, what about the second problem documented in https://github.com/ibis-project/ibis/issues/4676#issuecomment-1283756388
Do you want me to open a dedicated issue or both problems are likely to be solved by the same PR?