Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Document that we don't support the compatibility with non-Koalas APIs yet.

See original GitHub issue

Seems like people want to convert their codes directly from pandas to Koalas. One case I often observe is, they want to convert the codes that works together with other Python standard functions such as max, min, or list/generator comprehensions, e.g.)

import pandas as pd
data = []
for a in pd.Series([1, 2, 3]):
    data.append(a)

pd.DataFrame(data)

In Koalas, such example does not work. We should preemptively document and guide users to stick to Koalas APIs only.

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

HyukjinKwoncommented, Apr 12, 2020

I think we should better move them to Best Practice. I think we could rephrase, for example as below. Feel free to reword or rephrase.

Title: Use Koalas APIs directly whenever possible

Contents: While Koalas has similar APIs with pandas, some APIs are not explicitly supported. For example, Python built-in functions such as min, max, etc. require the given argument to be iterrable. Koalas does not implement __iter__() yet to prevent users to collect all data into the client (driver) side from the cluster. See the example below:

>>> import pandas as pd
>>> max(pd.Series([1, 2, 3]))
3
>>> min(pd.Series([1, 2, 3]))
1
>>> sum(pd.Series([1, 2, 3]))
6

pandas dataset live in the local, iterable … blah blah …

>>> import databricks.koalas as ks
>>> ks.Series([1, 2, 3]).max()
3
>>> ks.Series([1, 2, 3]).min()
1
>>> ks.Series([1, 2, 3]).sum()
6

Koalas performes it in a distributed manner… blah blah

… Another common pattern from pandas users is to rely on list or generator comprehensions …:

>>> import pandas as pd
>>> data = []
>>> pser = pd.Series([1, 2, 3])
>>> for v in pser:
...     data.append(v + 1)
>>> pd.Series(data)

In Koalas, you can do it via:

>>> import databricks.koalas as ks
>>> kser = ks.Series([1, 2, 3])
>>> kser + 1  # or kser.apply example? or kdf.map_in_pandas example?

In case of NumPy universial functions, they are supported and can be naturally used in most cases. -> it was added https://github.com/databricks/koalas/pull/1096 https://github.com/databricks/koalas/pull/1106 https://github.com/databricks/koalas/pull/1128 FYI

Using to_numpy should still be discouraged and the last resort.

0reactions

beobest2commented, Apr 12, 2020

okay~ I’ll open a PR. thank you

Top Results From Across the Web

10 Minutes from pandas to Koalas on Apache Spark - Databricks

This blog post will not only demonstrate how easy it is to convert code written in pandas to Koalas, but also discuss the...

From Pandas to PySpark with Koalas - Towards Data Science

I recently stumbled upon Koalas from a very interesting Databricks presentation about Apache Spark 3.0, Delta Lake and Koalas, and thought that ...

Koalas: pandas API on Apache Spark - PyPI

Some older versions of Spark may work too but they are not officially supported. A recent version of pandas. It is officially developed...

`unique()` has wrong return type · Issue #555 · databricks/koalas

I hear you on using it as a possible mitigation, but I thought the goal of koalas was to not require user changes...

Working with pandas and PySpark - Koalas - Read the Docs

Users from pandas and/or PySpark face API compatibility issue sometimes when they work with Koalas. Since Koalas does not target 100% compatibility of...