question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Apply Spark Column operations directly on a Series

See original GitHub issue

On issue #1492, I noticed it was noted the following operation was possible on a column of a Koalas DataFrame:

>>> import databricks.koalas as ks
>>> df = ks.DataFrame(["example"], columns=["column"])
>>> from pyspark.sql import functions as F
>>> df["column"] = F.trim(F.upper(F.col("column")))
>>> df
    column
0  EXAMPLE

By any chance, is there currently a way to apply Spark SQL/Column functions to a Koalas Series? I can imagine it looking something like so:

>>> import databricks.koalas as ks
>>> import pyspark.sql.functions as F
>>> kss = ks.Series(["example"])
>>> kss.apply(F.trim(F.upper(kss.spark_column)))
0    EXAMPLE
Name: 0, dtype: object

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
HyukjinKwoncommented, May 24, 2020

tada

>>> kss.spark.transform(lambda s: tokenize(s))
0            [1, example]
1    [2, string, example]
Name: 0, dtype: object
2reactions
HyukjinKwoncommented, May 22, 2020

I opened a PR. The idea is basically to collect everything related to Spark itself to .spark namespace. This specific issue could be resolved via:

>>> import databricks.koalas as ks
>>> import pyspark.sql.functions as F
>>> kss = ks.Series(["example"])
>>> kss.spark.transform(lambda spark_column: F.trim(F.upper(spark_column)))
0    EXAMPLE
Name: 0, dtype: object

Given that the current APIs in Pandas such as Series.transform, Series.apply, DataFrame.apply, DataFrame.transform, etc, I named it Series.spark.transform and made the usage similar. Also, it shares the same limitation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Operations on One Column - Spark for Data Scientists - GitBook
Select a subset of columns to show, use select("col1","col2". ... columnar means you can operate on columns only and directly with Spark native...
Read more >
Spark DataFrame withColumn
Spark withColumn() is a DataFrame function that is used to add a new column to DataFrame, change the value of an existing column,...
Read more >
Value and column operations in scala spark, how to use a ...
Show activity on this post. Use literal Column import org.apache.spark.sql.functions.lit lit(1) / col("col2").
Read more >
Essential PySpark DataFrame Column Operations for Data ...
PySpark Column Operations plays a key role in manipulating and displaying desired results of PySpark DataFrame. Let's understand them here.
Read more >
Column - Apache Spark
col("columnName.field") // Extracting a struct field col("`a.column.with.dots`") // Escape `.` in column names ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found