df.apply(func_without_typehint, axis=1) is not running in parallel
See original GitHub issueI was hoping that each operation, per row is gonna be executed in parallel. However this code (and many other experiments I did, including logging to MLflow) shows that it just executes it sequentially for every row.
import time
import databricks.koalas as ks
kdf = ks.DataFrame({"col":[i for i in range(12)]})
def do_job( value ) -> float:
time.sleep(5)
return value
kdf.apply(lambda x: do_job(x["col"]) , axis=1)
This is taking 60 sec with > 12 Spark workers. I was expecting every row to be executed in parallel, that would result in around 5 seconds.
Am I doing something wrong?
My workaround:
kdf.to_spark().rdd.map(lambda x: do_job(x["col"])).collect()
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Pandas apply in parallel when axis=0 - python - Stack Overflow
I want to apply some function on all pandas ...
Read more >Why isn't my Pandas .apply working? Try axis=1 - Medium
In our case we want to pass the columns, so if we now add axis=1 to our function df.apply(add_all,axis=1). there is no error...
Read more >powerful Python data analysis toolkit - pandas
columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results in more readable.
Read more >What's New - Xarray
Functions that are identities for 0d data return the unchanged data if axis is empty. This ensures that Datasets where some variables do...
Read more >pyspark.sql module — PySpark 3.0.1 documentation
pyspark.sql.functions List of built-in functions available for DataFrame . ... If no application name is set, a randomly generated name will be used....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes … we should definitely make such logic more visible and easy to use. Thanks for your feedback. I will think about how to make it easier and visible more.
Can you try this?
When
kdf.apply
takes a function with a type hint, it wouldn’t try the shortcut path as it can know the type right away from the given type.The reason it tried the shortcut was,
lambda x: do_job(x["col"])
lambda didn’t have the type hint. This seems taking 7~8 secs in my local.