question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

latest dask `map_partitions` doesn't pass list as expected.

See original GitHub issue

Reproducible example

I used the HIGGS dataset from UCI machine learning reprository for demonstraction purose. But any other dataset should suffice.

The failure happens only on dask 2021.05.0, older versions work fine. Also using tuple instead of list works fine.

from distributed import LocalCluster, Client, wait
from dask import dataframe as dd


def map_partition_fn(block, meta_list):
    # Works with  dask-2.22.0, assert error with 2021.05.0
    #
    # distributed.worker - WARNING - Compute Failed
    # Function:  execute_task
    # args:      ((subgraph_callable, [(<function read_block_from_file at 0x7f84259231f0>, <OpenFile 'higgs/HIGGS.csv'>, 640000000, 64000000, b'\n'), None, False]))
    # kwargs:    {}
    # Exception: AssertionError(<class 'str'>)
    assert isinstance(meta_list, list), type(meta_list)
    return block


def main(client):
    df: dd.DataFrame = dd.read_csv("higgs/HIGGS.csv")
    meta = [i for i in range(len(df.columns))]

    mapped = df.map_partitions(map_partition_fn, meta_list=meta)
    mapped = client.persist(mapped)
    wait(mapped)


if __name__ == "__main__":
    with LocalCluster() as cluster:
        with Client(cluster) as client:
            main(client)

  • Dask version: 2021.05.0
  • Python version: 3.8.5
  • Operating System: Ubuntu 20.04.2
  • Install method (conda, pip, source): pip

cc @jakirkham

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
rjzamoracommented, May 20, 2021

Thanks for raising this @trivialfis ! As James mentioned, the task fusion default was recently changed. I’ll look into this asap.

1reaction
jrbourbeaucommented, May 20, 2021

Thanks @trivialfis, I’m able to reproduce using minor.csv. I think this is related to Dask recently disabling low-level task fusion by default for DataFrames (xref https://github.com/dask/dask/pull/7620). If I turn task fusion back on with dask.config.set({"optimization.fuse.active": True}) then meta_list is a list as expected and the code snippet runs successfully

cc’ing @rjzamora in case you have any thoughts on this

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - simple dask map_partitions example - Stack Overflow
In case of dask. dataframe. map_partitions this first argument will be a partition and in case of pandas. DataFrame.
Read more >
Map_partitions question for image processing - Dask DataFrame ...
I am facing issues with using map_partitions import numpy as np import dask.dataframe as dd from dask.distributed import Client from dask.distributed import ...
Read more >
DataFrame.map_partitions - Dask documentation
Positional and keyword arguments to pass to the function. Positional arguments are computed on a per-partition basis, while keyword arguments are shared ...
Read more >
Behaviour of map_partitions with a Pandas dataframe as ...
I want to perform a merge between a Dask dataframe and a Pandas ... I pass as argument is partitionned in a way...
Read more >
Parallelize pandas apply() and map() with Dask DataFrame
With Dask's map_partitions(), you can work on each partition of your Dask DataFrame, which is a pandas DataFrame, while leveraging ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found