Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

df.apply(func_without_typehint, axis=1) is not running in parallel

See original GitHub issue

I was hoping that each operation, per row is gonna be executed in parallel. However this code (and many other experiments I did, including logging to MLflow) shows that it just executes it sequentially for every row.

import time
import databricks.koalas as ks

kdf = ks.DataFrame({"col":[i for i in range(12)]})

def do_job( value ) -> float:
  time.sleep(5)
  return value

kdf.apply(lambda x: do_job(x["col"]) , axis=1)

This is taking 60 sec with > 12 Spark workers. I was expecting every row to be executed in parallel, that would result in around 5 seconds.

Am I doing something wrong?

My workaround:

kdf.to_spark().rdd.map(lambda x: do_job(x["col"])).collect()

Issue Analytics

State:
Created 4 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

HyukjinKwoncommented, Feb 14, 2020

Yes … we should definitely make such logic more visible and easy to use. Thanks for your feedback. I will think about how to make it easier and visible more.

1reaction

HyukjinKwoncommented, Feb 14, 2020

Can you try this?

import time
import databricks.koalas as ks

kdf = ks.DataFrame({"col":[i for i in range(12)]})

def do_job( df ) -> float:
  value = df["col"]
  time.sleep(5)
  return value

start = time.time()
kdf.apply(do_job , axis=1)
time.time() - start

When kdf.apply takes a function with a type hint, it wouldn’t try the shortcut path as it can know the type right away from the given type.

The reason it tried the shortcut was, lambda x: do_job(x["col"]) lambda didn’t have the type hint. This seems taking 7~8 secs in my local.

Top Results From Across the Web

Pandas apply in parallel when axis=0 - python - Stack Overflow

I want to apply some function on all pandas ...

Why isn't my Pandas .apply working? Try axis=1 - Medium

In our case we want to pass the columns, so if we now add axis=1 to our function df.apply(add_all,axis=1). there is no error...

powerful Python data analysis toolkit - pandas

columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results in more readable.

What's New - Xarray

Functions that are identities for 0d data return the unchanged data if axis is empty. This ensures that Datasets where some variables do...

pyspark.sql module — PySpark 3.0.1 documentation

pyspark.sql.functions List of built-in functions available for DataFrame . ... If no application name is set, a randomly generated name will be used....