Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Koalas DataFrame should only have Pandas corresponding APIs (not Spark APIs)

See original GitHub issue

This is partially of https://github.com/databricks/koalas/issues/119

I was thinking Koalas DataFrame strictly should have Pandas corresponding APIs, although there might be few exceptions with some strong reasons.

Meaning Koalas DataFrame should not have Spark DataFrame specific APIs like explain() or selectExpr(). My current thought is we have a API like koalas_df.to_spark() (borrowed from @ueshin’s idea via offline discussion) so that users can use Spark APIs.

Koalas API usages

koalas_df.loc(...)
koalas_df.drop(...)  # works as Pandas API

Spark API usages

koalas_df.to_spark().explain()
koalas_df.to_spark().selectExpr()
koalas_df.to_spark().drop(...)  # works as Spark API

This can clearly define what Koalas APIs expect and Spark APIs expect.

Issue Analytics

State:
Created 4 years ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

thunterdbcommented, Apr 23, 2019

I think that users should be able to pick what they need, in particular things like .cache() or .repartition() which can be necessary. Doing koala_df.to_spark().cache().to_koalas() looses the index and metadata, so it is a no-go. Other users should also chime in though.

0reactions

rxincommented, May 1, 2019

Closing this as it will be part of the design philosophy doc.

Top Results From Across the Web

Interoperability between Koalas and Apache Spark - Databricks

Koalas is useful for not only pandas users but also PySpark users ... Koalas translates pandas APIs into the logical plan of Spark...

Design Principles — Koalas 1.8.2 documentation

The Koalas DataFrame is meant to provide the best of pandas and Spark under a single API, with easy and clear conversions between...

databricks.koalas.DataFrame — Koalas 1.8.2 documentation

Koalas DataFrame that corresponds to pandas DataFrame logically. This holds Spark DataFrame internally. _internal – an internal immutable Frame to manage ...

Working with pandas and PySpark - Koalas - Read the Docs

PySpark users can access to full PySpark APIs by calling DataFrame.to_spark() . Koalas DataFrame and Spark DataFrame are virtually interchangeable. For example, ...

Koalas: pandas API on Apache Spark — Koalas 1.8.2 ...

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache...