question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Document that we don't support the compatibility with non-Koalas APIs yet.

See original GitHub issue

Seems like people want to convert their codes directly from pandas to Koalas. One case I often observe is, they want to convert the codes that works together with other Python standard functions such as max, min, or list/generator comprehensions, e.g.)

import pandas as pd
data = []
for a in pd.Series([1, 2, 3]):
    data.append(a)

pd.DataFrame(data)

In Koalas, such example does not work. We should preemptively document and guide users to stick to Koalas APIs only.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
HyukjinKwoncommented, Apr 12, 2020

I think we should better move them to Best Practice. I think we could rephrase, for example as below. Feel free to reword or rephrase.

Title: Use Koalas APIs directly whenever possible

Contents: While Koalas has similar APIs with pandas, some APIs are not explicitly supported. For example, Python built-in functions such as min, max, etc. require the given argument to be iterrable. Koalas does not implement __iter__() yet to prevent users to collect all data into the client (driver) side from the cluster. See the example below:

>>> import pandas as pd
>>> max(pd.Series([1, 2, 3]))
3
>>> min(pd.Series([1, 2, 3]))
1
>>> sum(pd.Series([1, 2, 3]))
6

pandas dataset live in the local, iterable … blah blah …

>>> import databricks.koalas as ks
>>> ks.Series([1, 2, 3]).max()
3
>>> ks.Series([1, 2, 3]).min()
1
>>> ks.Series([1, 2, 3]).sum()
6

Koalas performes it in a distributed manner… blah blah

… Another common pattern from pandas users is to rely on list or generator comprehensions …:

>>> import pandas as pd
>>> data = []
>>> pser = pd.Series([1, 2, 3])
>>> for v in pser:
...     data.append(v + 1)
>>> pd.Series(data)

In Koalas, you can do it via:

>>> import databricks.koalas as ks
>>> kser = ks.Series([1, 2, 3])
>>> kser + 1  # or kser.apply example? or kdf.map_in_pandas example? 

In case of NumPy universial functions, they are supported and can be naturally used in most cases. -> it was added https://github.com/databricks/koalas/pull/1096 https://github.com/databricks/koalas/pull/1106 https://github.com/databricks/koalas/pull/1128 FYI

Using to_numpy should still be discouraged and the last resort.

0reactions
beobest2commented, Apr 12, 2020

okay~ I’ll open a PR. thank you

Read more comments on GitHub >

github_iconTop Results From Across the Web

10 Minutes from pandas to Koalas on Apache Spark - Databricks
This blog post will not only demonstrate how easy it is to convert code written in pandas to Koalas, but also discuss the...
Read more >
From Pandas to PySpark with Koalas - Towards Data Science
I recently stumbled upon Koalas from a very interesting Databricks presentation about Apache Spark 3.0, Delta Lake and Koalas, and thought that ...
Read more >
Koalas: pandas API on Apache Spark - PyPI
Some older versions of Spark may work too but they are not officially supported. A recent version of pandas. It is officially developed...
Read more >
`unique()` has wrong return type · Issue #555 · databricks/koalas
I hear you on using it as a possible mitigation, but I thought the goal of koalas was to not require user changes...
Read more >
Working with pandas and PySpark - Koalas - Read the Docs
Users from pandas and/or PySpark face API compatibility issue sometimes when they work with Koalas. Since Koalas does not target 100% compatibility of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found