Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QUESTION] Pandas mutability causes different results compared to Spark and DuckDB

See original GitHub issue

The following code:

import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [ 2,  3,  4]})
df2 = pd.DataFrame({'col1': [1, 2, 3], 'col4': [11, 12, 13]})

# schema: *, col3:int
def make_new_col(df: pd.DataFrame) -> pd.DataFrame:
    ''''''
    df['col3'] = df['col1'] + df['col2']
    return df 

from fugue_sql import fsql
res = fsql(
    '''
    transformed = TRANSFORM df1 USING make_new_col
    YIELD DATAFRAME AS partial_result

    SELECT transformed.*, df2.col4
    FROM transformed
    INNER JOIN df2 ON transformed.col1 = df2.col1
    YIELD DATAFRAME AS result
    '''
).run('duckdb')

works in the same way using DuckDB, pands or Spark as engine. Returning:

res['partial_result'].as_pandas()

col1	col2	col3
1	2	3
2	3	5
3	4	7

res['result'].as_pandas()

col1	col2	col3	col4
1	2	3	11
2	3	5	12
3	4	7	13

But, if I change the first row of the sql, from: transformed = TRANSFORM df1 USING make_new_col To: transformed = SELECT * FROM df1 WHERE 1=1 TRANSFORM USING make_new_col

I obtain 2 different solution, one for Pandas and another one for DuckDB and Spark: with Pandas the results remains the same as above, while for the other engines, res['partial_result'] still the same, but res['result'] it’s different:

col1	col2	col4
1	2	11
2	3	12
3	4	13

It seems that in the JOIN operation the transformed was missing of the col3 generated by the make_new_col function.

Adding a PRINT transformed after the first yield (YIELD DATAFRAME AS partial_result), i see that, for both pandas and Spark|DuckSB, transformed does not contain the new col3.

I don’t understand 2 things at this point:

why is that transformed does not contains col3, what is wrong with transformed = SELECT * FROM df1 WHERE 1=1 TRANSFORM USING make_new_col
if (for Pandas) transformed does not contains col3, how it’s possible that in after the JOIN i obtain a result with also col3

Issue Analytics

State:
Created a year ago
Comments:9

Top GitHub Comments

1reaction

lukeb88commented, Sep 8, 2022

@goodwanghan I understand that it is a difficult choice.

(Assuming that deepcopy can be done conditionally, only when Pandas has been chosen as the engine)

I would probably go with deepcopy, keeping in mind that: it becomes a problem with large datasets, but at that point, it would probably not make sense to use pandas as the engine anyway.

I understand that you place emphasis on performance, but on the other hand, Fugue is an interface and perhaps it’s even more important the fact that it is 100% consistent across all engines…

0reactions

goodwanghancommented, Sep 9, 2022

@lukeb88 I think it is very well said, and it also aligned with Fugue’s priority: consistency is more important than performance. I will create a PR to make the change, or if you are interested you can create the first PR for Fugue 😃

Top Results From Across the Web

Efficient SQL on Pandas with DuckDB

TLDR: DuckDB, a free and open source analytical data management system, can efficiently run SQL queries directly on Pandas DataFrames.

Issues · fugue-project/fugue - GitHub

Fugue executes SQL, Python, and Pandas code on Spark, ... [QUESTION] Pandas mutability causes different results compared to Spark and DuckDB core feature ......

Executing an SQL query over a pandas dataset - Stack Overflow

Another solution is RBQL which provides SQL-like query language that allows using Python expression inside SELECT and WHERE statements.

Spark vs Pandas, part 1 - Towards Data Science

I will present both frameworks Pandas and Spark and discuss their strengths and weaknesses to set the ground for a fair comparison.

Low overhead self-optimizing storage for compression in ...

In chapter 4, we compare various reordering methods on ... causing DuckDB to compress all columns of the table. A .db file is...