Use encapsulation instead of monkey patching / inheritance
See original GitHub issueAfter spending some time thinking about the issue, I think it would be better for the project to create its own entry point and its own DataFrame / Series class, rather than doing it through monkey patching or inherence to change the existing Spark DataFrame API.
The main reason I am leaning this way now is because Spark and Pandas DataFrames already have quite a few functions with the same name but different semantics, e.g. sample
, head
. Keeping existing Spark behavior does not accomplish the goal of the project, while changing existing Spark behavior based on some import statement is an anti-pattern of Python code. Doing it through encapsulation also avoids the issue of Koalas polluting the internal state of Spark’s DataFrame through monkey patching.
So here’s an alternative design:
koalas: parallel to pandas namespace. Includes I/O functions (read_csv
, read_parquet
) as well as general functions (e.g. concat
, get_dummies
). Functions here would only work if there is an active SparkSession.
koalas.DataFrame: A completely new interface based on Pandas’ DataFrame API. KoalaDataFrame wraps around a Spark DataFrame along with additional internal states.
koalas.Series: Based on Pandas’ DataFrame API.
The only thing I’d monkey patch into Spark’s DataFrame is a toKoalas method, which turns an existing Spark DataFrame into a Koalas DataFrame.
While this change will bring more code, we will get the following benefits:
- Very clear when users will get the Pandas behavior vs Spark behavior.
- Documentation also becomes clear, as we just provide documentation on Koalas.
The main tradeoff is:
If we go with encapsulation, there is a separate class hierarchy in addition to the existing Spark ones, and the new hierarchy follows strictly Pandas semantics. The downside is that users cannot directly use a Spark DataFrame with Pandas code, and they need to call “spark_ds.toKoalas”.
If we go with the current monkey patching approach, it’s the opposite of the above. Users can directly use a Spark DataFrame with Pandas code, but existing Spark code’s behavior might change as soon as they import koalas package. For example, “head” now returns 5 rows, rather than 1 row. "columns’ return a Pandas Index, rather than a Python list. “sample” in the future might also change.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (7 by maintainers)
Top GitHub Comments
We have switched to the encapsulation approach already, but haven’t added the toKoalas method yet. Will add that in the next release.
BTW - should it be more pythonic and called to_koalas, to_ks, or toKoalas?
I’m looking to port an existing Python library to use Koalas in place of Pandas, and pass it DataFrames created from the
toKoalas
method mentioned above. The encapsulation approach seems better for this use case. Are there any open issues on thetoKoalas
implementation?