Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use encapsulation instead of monkey patching / inheritance

See original GitHub issue

After spending some time thinking about the issue, I think it would be better for the project to create its own entry point and its own DataFrame / Series class, rather than doing it through monkey patching or inherence to change the existing Spark DataFrame API.

The main reason I am leaning this way now is because Spark and Pandas DataFrames already have quite a few functions with the same name but different semantics, e.g. sample, head. Keeping existing Spark behavior does not accomplish the goal of the project, while changing existing Spark behavior based on some import statement is an anti-pattern of Python code. Doing it through encapsulation also avoids the issue of Koalas polluting the internal state of Spark’s DataFrame through monkey patching.

So here’s an alternative design:

koalas: parallel to pandas namespace. Includes I/O functions (read_csv, read_parquet) as well as general functions (e.g. concat, get_dummies). Functions here would only work if there is an active SparkSession.

koalas.DataFrame: A completely new interface based on Pandas’ DataFrame API. KoalaDataFrame wraps around a Spark DataFrame along with additional internal states.

koalas.Series: Based on Pandas’ DataFrame API.

The only thing I’d monkey patch into Spark’s DataFrame is a toKoalas method, which turns an existing Spark DataFrame into a Koalas DataFrame.

While this change will bring more code, we will get the following benefits:

Very clear when users will get the Pandas behavior vs Spark behavior.
Documentation also becomes clear, as we just provide documentation on Koalas.

The main tradeoff is:

If we go with encapsulation, there is a separate class hierarchy in addition to the existing Spark ones, and the new hierarchy follows strictly Pandas semantics. The downside is that users cannot directly use a Spark DataFrame with Pandas code, and they need to call “spark_ds.toKoalas”.

If we go with the current monkey patching approach, it’s the opposite of the above. Users can directly use a Spark DataFrame with Pandas code, but existing Spark code’s behavior might change as soon as they import koalas package. For example, “head” now returns 5 rows, rather than 1 row. "columns’ return a Pandas Index, rather than a Python list. “sample” in the future might also change.

Issue Analytics

State:
Created 4 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

rxincommented, Apr 25, 2019

We have switched to the encapsulation approach already, but haven’t added the toKoalas method yet. Will add that in the next release.

BTW - should it be more pythonic and called to_koalas, to_ks, or toKoalas?

1reaction

bgwebercommented, Apr 25, 2019

I’m looking to port an existing Python library to use Koalas in place of Pandas, and pass it DataFrames created from the toKoalas method mentioned above. The encapsulation approach seems better for this use case. Are there any open issues on the toKoalas implementation?