question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QUESTION] Pandas mutability causes different results compared to Spark and DuckDB

See original GitHub issue

The following code:

import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [ 2,  3,  4]})
df2 = pd.DataFrame({'col1': [1, 2, 3], 'col4': [11, 12, 13]})

# schema: *, col3:int
def make_new_col(df: pd.DataFrame) -> pd.DataFrame:
    ''''''
    df['col3'] = df['col1'] + df['col2']
    return df 

from fugue_sql import fsql
res = fsql(
    '''
    transformed = TRANSFORM df1 USING make_new_col
    YIELD DATAFRAME AS partial_result

    SELECT transformed.*, df2.col4
    FROM transformed
    INNER JOIN df2 ON transformed.col1 = df2.col1
    YIELD DATAFRAME AS result
    '''
).run('duckdb')

works in the same way using DuckDB, pands or Spark as engine. Returning:

res['partial_result'].as_pandas()
col1 col2 col3
1 2 3
2 3 5
3 4 7
res['result'].as_pandas()
col1 col2 col3 col4
1 2 3 11
2 3 5 12
3 4 7 13

But, if I change the first row of the sql, from: transformed = TRANSFORM df1 USING make_new_col To: transformed = SELECT * FROM df1 WHERE 1=1 TRANSFORM USING make_new_col

I obtain 2 different solution, one for Pandas and another one for DuckDB and Spark: with Pandas the results remains the same as above, while for the other engines, res['partial_result'] still the same, but res['result'] it’s different:

col1 col2 col4
1 2 11
2 3 12
3 4 13

It seems that in the JOIN operation the transformed was missing of the col3 generated by the make_new_col function.

Adding a PRINT transformed after the first yield (YIELD DATAFRAME AS partial_result), i see that, for both pandas and Spark|DuckSB, transformed does not contain the new col3.

I don’t understand 2 things at this point:

  1. why is that transformed does not contains col3, what is wrong with transformed = SELECT * FROM df1 WHERE 1=1 TRANSFORM USING make_new_col
  2. if (for Pandas) transformed does not contains col3, how it’s possible that in after the JOIN i obtain a result with also col3

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
lukeb88commented, Sep 8, 2022

@goodwanghan I understand that it is a difficult choice.

(Assuming that deepcopy can be done conditionally, only when Pandas has been chosen as the engine)

I would probably go with deepcopy, keeping in mind that: it becomes a problem with large datasets, but at that point, it would probably not make sense to use pandas as the engine anyway.

I understand that you place emphasis on performance, but on the other hand, Fugue is an interface and perhaps it’s even more important the fact that it is 100% consistent across all engines…

0reactions
goodwanghancommented, Sep 9, 2022

@lukeb88 I think it is very well said, and it also aligned with Fugue’s priority: consistency is more important than performance. I will create a PR to make the change, or if you are interested you can create the first PR for Fugue 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient SQL on Pandas with DuckDB
TLDR: DuckDB, a free and open source analytical data management system, can efficiently run SQL queries directly on Pandas DataFrames.
Read more >
Issues · fugue-project/fugue - GitHub
Fugue executes SQL, Python, and Pandas code on Spark, ... [QUESTION] Pandas mutability causes different results compared to Spark and DuckDB core feature ......
Read more >
Executing an SQL query over a pandas dataset - Stack Overflow
Another solution is RBQL which provides SQL-like query language that allows using Python expression inside SELECT and WHERE statements.
Read more >
Spark vs Pandas, part 1 - Towards Data Science
I will present both frameworks Pandas and Spark and discuss their strengths and weaknesses to set the ground for a fair comparison.
Read more >
Low overhead self-optimizing storage for compression in ...
In chapter 4, we compare various reordering methods on ... causing DuckDB to compress all columns of the table. A .db file is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found