latest dask `map_partitions` doesn't pass list as expected.
See original GitHub issueReproducible example
I used the HIGGS dataset from UCI machine learning reprository for demonstraction purose. But any other dataset should suffice.
The failure happens only on dask 2021.05.0, older versions work fine. Also using tuple instead of list works fine.
from distributed import LocalCluster, Client, wait
from dask import dataframe as dd
def map_partition_fn(block, meta_list):
# Works with dask-2.22.0, assert error with 2021.05.0
#
# distributed.worker - WARNING - Compute Failed
# Function: execute_task
# args: ((subgraph_callable, [(<function read_block_from_file at 0x7f84259231f0>, <OpenFile 'higgs/HIGGS.csv'>, 640000000, 64000000, b'\n'), None, False]))
# kwargs: {}
# Exception: AssertionError(<class 'str'>)
assert isinstance(meta_list, list), type(meta_list)
return block
def main(client):
df: dd.DataFrame = dd.read_csv("higgs/HIGGS.csv")
meta = [i for i in range(len(df.columns))]
mapped = df.map_partitions(map_partition_fn, meta_list=meta)
mapped = client.persist(mapped)
wait(mapped)
if __name__ == "__main__":
with LocalCluster() as cluster:
with Client(cluster) as client:
main(client)
- Dask version: 2021.05.0
- Python version: 3.8.5
- Operating System: Ubuntu 20.04.2
- Install method (conda, pip, source): pip
cc @jakirkham
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
python - simple dask map_partitions example - Stack Overflow
In case of dask. dataframe. map_partitions this first argument will be a partition and in case of pandas. DataFrame.
Read more >Map_partitions question for image processing - Dask DataFrame ...
I am facing issues with using map_partitions import numpy as np import dask.dataframe as dd from dask.distributed import Client from dask.distributed import ...
Read more >DataFrame.map_partitions - Dask documentation
Positional and keyword arguments to pass to the function. Positional arguments are computed on a per-partition basis, while keyword arguments are shared ...
Read more >Behaviour of map_partitions with a Pandas dataframe as ...
I want to perform a merge between a Dask dataframe and a Pandas ... I pass as argument is partitionned in a way...
Read more >Parallelize pandas apply() and map() with Dask DataFrame
With Dask's map_partitions(), you can work on each partition of your Dask DataFrame, which is a pandas DataFrame, while leveraging ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thanks for raising this @trivialfis ! As James mentioned, the task fusion default was recently changed. I’ll look into this asap.
Thanks @trivialfis, I’m able to reproduce using
minor.csv. I think this is related to Dask recently disabling low-level task fusion by default for DataFrames (xref https://github.com/dask/dask/pull/7620). If I turn task fusion back on withdask.config.set({"optimization.fuse.active": True})thenmeta_listis a list as expected and the code snippet runs successfullycc’ing @rjzamora in case you have any thoughts on this