Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Apply Spark Column operations directly on a Series

See original GitHub issue

On issue #1492, I noticed it was noted the following operation was possible on a column of a Koalas DataFrame:

>>> import databricks.koalas as ks
>>> df = ks.DataFrame(["example"], columns=["column"])
>>> from pyspark.sql import functions as F
>>> df["column"] = F.trim(F.upper(F.col("column")))
>>> df
    column
0  EXAMPLE

By any chance, is there currently a way to apply Spark SQL/Column functions to a Koalas Series? I can imagine it looking something like so:

>>> import databricks.koalas as ks
>>> import pyspark.sql.functions as F
>>> kss = ks.Series(["example"])
>>> kss.apply(F.trim(F.upper(kss.spark_column)))
0    EXAMPLE
Name: 0, dtype: object

Issue Analytics

State:
Created 3 years ago
Comments:14 (8 by maintainers)

Top GitHub Comments

2reactions

HyukjinKwoncommented, May 24, 2020

tada

>>> kss.spark.transform(lambda s: tokenize(s))
0            [1, example]
1    [2, string, example]
Name: 0, dtype: object

2reactions

HyukjinKwoncommented, May 22, 2020

I opened a PR. The idea is basically to collect everything related to Spark itself to .spark namespace. This specific issue could be resolved via:

>>> import databricks.koalas as ks
>>> import pyspark.sql.functions as F
>>> kss = ks.Series(["example"])
>>> kss.spark.transform(lambda spark_column: F.trim(F.upper(spark_column)))
0    EXAMPLE
Name: 0, dtype: object

Given that the current APIs in Pandas such as Series.transform, Series.apply, DataFrame.apply, DataFrame.transform, etc, I named it Series.spark.transform and made the usage similar. Also, it shares the same limitation.