Koalas DataFrame should only have Pandas corresponding APIs (not Spark APIs)
See original GitHub issueThis is partially of https://github.com/databricks/koalas/issues/119
I was thinking Koalas DataFrame strictly should have Pandas corresponding APIs, although there might be few exceptions with some strong reasons.
Meaning Koalas DataFrame should not have Spark DataFrame specific APIs like explain()
or selectExpr()
. My current thought is we have a API like koalas_df.to_spark()
(borrowed from @ueshin’s idea via offline discussion) so that users can use Spark APIs.
Koalas API usages
koalas_df.loc(...)
koalas_df.drop(...) # works as Pandas API
Spark API usages
koalas_df.to_spark().explain()
koalas_df.to_spark().selectExpr()
koalas_df.to_spark().drop(...) # works as Spark API
This can clearly define what Koalas APIs expect and Spark APIs expect.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (7 by maintainers)
Top Results From Across the Web
Interoperability between Koalas and Apache Spark - Databricks
Koalas is useful for not only pandas users but also PySpark users ... Koalas translates pandas APIs into the logical plan of Spark...
Read more >Design Principles — Koalas 1.8.2 documentation
The Koalas DataFrame is meant to provide the best of pandas and Spark under a single API, with easy and clear conversions between...
Read more >databricks.koalas.DataFrame — Koalas 1.8.2 documentation
Koalas DataFrame that corresponds to pandas DataFrame logically. This holds Spark DataFrame internally. _internal – an internal immutable Frame to manage ...
Read more >Working with pandas and PySpark - Koalas - Read the Docs
PySpark users can access to full PySpark APIs by calling DataFrame.to_spark() . Koalas DataFrame and Spark DataFrame are virtually interchangeable. For example, ...
Read more >Koalas: pandas API on Apache Spark — Koalas 1.8.2 ...
The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think that users should be able to pick what they need, in particular things like .cache() or .repartition() which can be necessary. Doing
koala_df.to_spark().cache().to_koalas()
looses the index and metadata, so it is a no-go. Other users should also chime in though.Closing this as it will be part of the design philosophy doc.