question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

df.apply(func_without_typehint, axis=1) is not running in parallel

See original GitHub issue

I was hoping that each operation, per row is gonna be executed in parallel. However this code (and many other experiments I did, including logging to MLflow) shows that it just executes it sequentially for every row.

import time
import databricks.koalas as ks

kdf = ks.DataFrame({"col":[i for i in range(12)]})

def do_job( value ) -> float:
  time.sleep(5)
  return value

kdf.apply(lambda x: do_job(x["col"]) , axis=1)

This is taking 60 sec with > 12 Spark workers. I was expecting every row to be executed in parallel, that would result in around 5 seconds.

Am I doing something wrong?

My workaround:

kdf.to_spark().rdd.map(lambda x: do_job(x["col"])).collect()

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
HyukjinKwoncommented, Feb 14, 2020

Yes … we should definitely make such logic more visible and easy to use. Thanks for your feedback. I will think about how to make it easier and visible more.

1reaction
HyukjinKwoncommented, Feb 14, 2020

Can you try this?

import time
import databricks.koalas as ks

kdf = ks.DataFrame({"col":[i for i in range(12)]})

def do_job( df ) -> float:
  value = df["col"]
  time.sleep(5)
  return value

start = time.time()
kdf.apply(do_job , axis=1)
time.time() - start

When kdf.apply takes a function with a type hint, it wouldn’t try the shortcut path as it can know the type right away from the given type.

The reason it tried the shortcut was, lambda x: do_job(x["col"]) lambda didn’t have the type hint. This seems taking 7~8 secs in my local.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas apply in parallel when axis=0 - python - Stack Overflow
I want to apply some function on all pandas ...
Read more >
Why isn't my Pandas .apply working? Try axis=1 - Medium
In our case we want to pass the columns, so if we now add axis=1 to our function df.apply(add_all,axis=1). there is no error...
Read more >
powerful Python data analysis toolkit - pandas
columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results in more readable.
Read more >
What's New - Xarray
Functions that are identities for 0d data return the unchanged data if axis is empty. This ensures that Datasets where some variables do...
Read more >
pyspark.sql module — PySpark 3.0.1 documentation
pyspark.sql.functions List of built-in functions available for DataFrame . ... If no application name is set, a randomly generated name will be used....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found