question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

map_partitions unexpected behavior

See original GitHub issue

The dask dataframe has 4 partitions, and I mapped the “write_and_run” func to each partition,

Expected: 1. There are total 4 subprocesses started. 2. I am using 2 gpus and I also suppose the partitions are distributed evenly to each gpu

Actual: There are more than 4 subprocesses started because I can see more than 4 prints of “starting subprocess”

def write_and_run(df):
    #do write stuff
    print("starting subprocess")
    process = subprocess.run(shlex.split(command), stdout=open(stdout_filename,mode='a'),stderr=open(err_filename, mode='a'))
    # do extra stuff

out = ddf.map_partitions(write_and_run)
#out.compute()
persisted_values = dask.persist(*out)
for pv in persisted_values:
    try:
        wait(pv)
    except Exception:
        print("Error encountered")
        pass

Environment:

  • Dask version: latest
  • Python version: 3.7
  • Operating System: Linux
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jrbourbeaucommented, May 19, 2021

It sounds like you want to explicitly not utilize all the resources available since you don’t want to run write_and_run in parallel on the same node. If setting the worker threadpool to only have a single thread is too restrictive (for example, if you have other tasks in your workflow that you do want to run in parallel) then you can use a SerializableLock inside write_and_run to ensure that only one of the worker’s threads can run write_and_run at a time

0reactions
fredmscommented, May 18, 2021

That’s great. thanks @jrbourbeau .

In such case, how to fully utilize the resource capacity without starting multiple subprocesses at the same time?

for example, I have a compute node with capacity (6 cores, 56 GB RAM, 380 GB disk). if I limit the --nprocs 1 --nthreads 1, only 1/6 capacity of compute node is used.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache Spark collect Exceptions or messages, which ...
Apache Spark collect Exceptions or messages, which describe unexpected behavior · Subscribe to RSS.
Read more >
Change RDD.aggregate() to do reduce(mapPartitions ...
This way, we remove the unnecessary initial comboOp on each partition and also correct the unexpected behavior for mutable zeroValues.
Read more >
pyspark-tutorial/README.md at master - map-partitions - GitHub
The mapPartitions() transformation should be used when you want to extract some condensed information (such as finding the minimum and maximum of numbers)...
Read more >
DataFrame.map_partitions - Dask documentation
If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more...
Read more >
PySpark mapPartitions() Examples
Similar to map() PySpark mapPartitions() is a narrow transformation operation that applies a function to each partition of the RDD, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found