Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

latest dask `map_partitions` doesn't pass list as expected.

See original GitHub issue

Reproducible example

I used the HIGGS dataset from UCI machine learning reprository for demonstraction purose. But any other dataset should suffice.

The failure happens only on dask 2021.05.0, older versions work fine. Also using tuple instead of list works fine.

from distributed import LocalCluster, Client, wait
from dask import dataframe as dd


def map_partition_fn(block, meta_list):
    # Works with  dask-2.22.0, assert error with 2021.05.0
    #
    # distributed.worker - WARNING - Compute Failed
    # Function:  execute_task
    # args:      ((subgraph_callable, [(<function read_block_from_file at 0x7f84259231f0>, <OpenFile 'higgs/HIGGS.csv'>, 640000000, 64000000, b'\n'), None, False]))
    # kwargs:    {}
    # Exception: AssertionError(<class 'str'>)
    assert isinstance(meta_list, list), type(meta_list)
    return block


def main(client):
    df: dd.DataFrame = dd.read_csv("higgs/HIGGS.csv")
    meta = [i for i in range(len(df.columns))]

    mapped = df.map_partitions(map_partition_fn, meta_list=meta)
    mapped = client.persist(mapped)
    wait(mapped)


if __name__ == "__main__":
    with LocalCluster() as cluster:
        with Client(cluster) as client:
            main(client)

Dask version: 2021.05.0
Python version: 3.8.5
Operating System: Ubuntu 20.04.2
Install method (conda, pip, source): pip

cc @jakirkham

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

3reactions

rjzamoracommented, May 20, 2021

Thanks for raising this @trivialfis ! As James mentioned, the task fusion default was recently changed. I’ll look into this asap.

1reaction

jrbourbeaucommented, May 20, 2021

Thanks @trivialfis, I’m able to reproduce using minor.csv. I think this is related to Dask recently disabling low-level task fusion by default for DataFrames (xref https://github.com/dask/dask/pull/7620). If I turn task fusion back on with dask.config.set({"optimization.fuse.active": True}) then meta_list is a list as expected and the code snippet runs successfully

cc’ing @rjzamora in case you have any thoughts on this

Top Results From Across the Web

python - simple dask map_partitions example - Stack Overflow

In case of dask. dataframe. map_partitions this first argument will be a partition and in case of pandas. DataFrame.

Map_partitions question for image processing - Dask DataFrame ...

I am facing issues with using map_partitions import numpy as np import dask.dataframe as dd from dask.distributed import Client from dask.distributed import ...

DataFrame.map_partitions - Dask documentation

Positional and keyword arguments to pass to the function. Positional arguments are computed on a per-partition basis, while keyword arguments are shared ...

Behaviour of map_partitions with a Pandas dataframe as ...

I want to perform a merge between a Dask dataframe and a Pandas ... I pass as argument is partitionned in a way...

Parallelize pandas apply() and map() with Dask DataFrame

With Dask's map_partitions(), you can work on each partition of your Dask DataFrame, which is a pandas DataFrame, while leveraging ...