question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use encapsulation instead of monkey patching / inheritance

See original GitHub issue

After spending some time thinking about the issue, I think it would be better for the project to create its own entry point and its own DataFrame / Series class, rather than doing it through monkey patching or inherence to change the existing Spark DataFrame API.

The main reason I am leaning this way now is because Spark and Pandas DataFrames already have quite a few functions with the same name but different semantics, e.g. sample, head. Keeping existing Spark behavior does not accomplish the goal of the project, while changing existing Spark behavior based on some import statement is an anti-pattern of Python code. Doing it through encapsulation also avoids the issue of Koalas polluting the internal state of Spark’s DataFrame through monkey patching.

So here’s an alternative design:

koalas: parallel to pandas namespace. Includes I/O functions (read_csv, read_parquet) as well as general functions (e.g. concat, get_dummies). Functions here would only work if there is an active SparkSession.

koalas.DataFrame: A completely new interface based on Pandas’ DataFrame API. KoalaDataFrame wraps around a Spark DataFrame along with additional internal states.

koalas.Series: Based on Pandas’ DataFrame API.

The only thing I’d monkey patch into Spark’s DataFrame is a toKoalas method, which turns an existing Spark DataFrame into a Koalas DataFrame.

While this change will bring more code, we will get the following benefits:

  • Very clear when users will get the Pandas behavior vs Spark behavior.
  • Documentation also becomes clear, as we just provide documentation on Koalas.

The main tradeoff is:

If we go with encapsulation, there is a separate class hierarchy in addition to the existing Spark ones, and the new hierarchy follows strictly Pandas semantics. The downside is that users cannot directly use a Spark DataFrame with Pandas code, and they need to call “spark_ds.toKoalas”.

If we go with the current monkey patching approach, it’s the opposite of the above. Users can directly use a Spark DataFrame with Pandas code, but existing Spark code’s behavior might change as soon as they import koalas package. For example, “head” now returns 5 rows, rather than 1 row. "columns’ return a Pandas Index, rather than a Python list. “sample” in the future might also change.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
rxincommented, Apr 25, 2019

We have switched to the encapsulation approach already, but haven’t added the toKoalas method yet. Will add that in the next release.

BTW - should it be more pythonic and called to_koalas, to_ks, or toKoalas?

1reaction
bgwebercommented, Apr 25, 2019

I’m looking to port an existing Python library to use Koalas in place of Pandas, and pass it DataFrames created from the toKoalas method mentioned above. The encapsulation approach seems better for this use case. Are there any open issues on the toKoalas implementation?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Do Ruby's "Open Classes" break encapsulation?
You should never monkey-patch your own classes. There's simply no point. You control them, you can make them do what you want in...
Read more >
Python monkey patching (for readability) - Rui Vieira
A limitation of the “monkey patching” method, is that attributes can only be changed at the class definition level. As an example, although ......
Read more >
Ruby Monkey Patching - Nicholas Johnson
This is a nice syntax, although it makes some people cross as it appears to break encapsulation. Monkey patching is fun, but use...
Read more >
Monkey Patching and its consequences - Python for the Lab
Monkey patching is a technique that allows you to alter the behavior of objects at runtime. Even though it can be a very...
Read more >
Monkeypatching For Humans - Coding Horror
Overcomplicated - the use of a monkey patch actually created more work for the ... In C# you can only override inherited class...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found