[QUESTION] Pandas mutability causes different results compared to Spark and DuckDB
See original GitHub issueThe following code:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [ 2, 3, 4]})
df2 = pd.DataFrame({'col1': [1, 2, 3], 'col4': [11, 12, 13]})
# schema: *, col3:int
def make_new_col(df: pd.DataFrame) -> pd.DataFrame:
''''''
df['col3'] = df['col1'] + df['col2']
return df
from fugue_sql import fsql
res = fsql(
'''
transformed = TRANSFORM df1 USING make_new_col
YIELD DATAFRAME AS partial_result
SELECT transformed.*, df2.col4
FROM transformed
INNER JOIN df2 ON transformed.col1 = df2.col1
YIELD DATAFRAME AS result
'''
).run('duckdb')
works in the same way using DuckDB, pands or Spark as engine. Returning:
res['partial_result'].as_pandas()
col1 | col2 | col3 |
---|---|---|
1 | 2 | 3 |
2 | 3 | 5 |
3 | 4 | 7 |
res['result'].as_pandas()
col1 | col2 | col3 | col4 |
---|---|---|---|
1 | 2 | 3 | 11 |
2 | 3 | 5 | 12 |
3 | 4 | 7 | 13 |
But, if I change the first row of the sql, from:
transformed = TRANSFORM df1 USING make_new_col
To:
transformed = SELECT * FROM df1 WHERE 1=1 TRANSFORM USING make_new_col
I obtain 2 different solution, one for Pandas and another one for DuckDB and Spark: with Pandas the results remains the same as above, while for the other engines, res['partial_result']
still the same, but res['result']
it’s different:
col1 | col2 | col4 |
---|---|---|
1 | 2 | 11 |
2 | 3 | 12 |
3 | 4 | 13 |
It seems that in the JOIN operation the transformed
was missing of the col3
generated by the make_new_col
function.
Adding a PRINT transformed
after the first yield (YIELD DATAFRAME AS partial_result
), i see that, for both pandas and Spark|DuckSB, transformed
does not contain the new col3
.
I don’t understand 2 things at this point:
- why is that
transformed
does not containscol3
, what is wrong withtransformed = SELECT * FROM df1 WHERE 1=1 TRANSFORM USING make_new_col
- if (for Pandas)
transformed
does not containscol3
, how it’s possible that in after the JOIN i obtain aresult
with alsocol3
Issue Analytics
- State:
- Created a year ago
- Comments:9
Top GitHub Comments
@goodwanghan I understand that it is a difficult choice.
(Assuming that deepcopy can be done conditionally, only when Pandas has been chosen as the engine)
I would probably go with deepcopy, keeping in mind that: it becomes a problem with large datasets, but at that point, it would probably not make sense to use pandas as the engine anyway.
I understand that you place emphasis on performance, but on the other hand, Fugue is an interface and perhaps it’s even more important the fact that it is 100% consistent across all engines…
@lukeb88 I think it is very well said, and it also aligned with Fugue’s priority: consistency is more important than performance. I will create a PR to make the change, or if you are interested you can create the first PR for Fugue 😃