Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

map_partitions unexpected behavior

See original GitHub issue

The dask dataframe has 4 partitions, and I mapped the “write_and_run” func to each partition,

Expected: 1. There are total 4 subprocesses started. 2. I am using 2 gpus and I also suppose the partitions are distributed evenly to each gpu

Actual: There are more than 4 subprocesses started because I can see more than 4 prints of “starting subprocess”

def write_and_run(df):
    #do write stuff
    print("starting subprocess")
    process = subprocess.run(shlex.split(command), stdout=open(stdout_filename,mode='a'),stderr=open(err_filename, mode='a'))
    # do extra stuff

out = ddf.map_partitions(write_and_run)
#out.compute()
persisted_values = dask.persist(*out)
for pv in persisted_values:
    try:
        wait(pv)
    except Exception:
        print("Error encountered")
        pass

Environment:

Dask version: latest
Python version: 3.7
Operating System: Linux
Install method (conda, pip, source): pip

Issue Analytics

State:
Created 2 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

jrbourbeaucommented, May 19, 2021

It sounds like you want to explicitly not utilize all the resources available since you don’t want to run write_and_run in parallel on the same node. If setting the worker threadpool to only have a single thread is too restrictive (for example, if you have other tasks in your workflow that you do want to run in parallel) then you can use a SerializableLock inside write_and_run to ensure that only one of the worker’s threads can run write_and_run at a time

0reactions

fredmscommented, May 18, 2021

That’s great. thanks @jrbourbeau .

In such case, how to fully utilize the resource capacity without starting multiple subprocesses at the same time?

for example, I have a compute node with capacity (6 cores, 56 GB RAM, 380 GB disk). if I limit the --nprocs 1 --nthreads 1, only 1/6 capacity of compute node is used.

Top Results From Across the Web

Apache Spark collect Exceptions or messages, which ...

Apache Spark collect Exceptions or messages, which describe unexpected behavior · Subscribe to RSS.

Change RDD.aggregate() to do reduce(mapPartitions ...

This way, we remove the unnecessary initial comboOp on each partition and also correct the unexpected behavior for mutable zeroValues.

pyspark-tutorial/README.md at master - map-partitions - GitHub

The mapPartitions() transformation should be used when you want to extract some condensed information (such as finding the minimum and maximum of numbers)...

DataFrame.map_partitions - Dask documentation

If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more...

PySpark mapPartitions() Examples

Similar to map() PySpark mapPartitions() is a narrow transformation operation that applies a function to each partition of the RDD, ...