map_partitions unexpected behavior
See original GitHub issueThe dask dataframe has 4 partitions, and I mapped the “write_and_run” func to each partition,
Expected: 1. There are total 4 subprocesses started. 2. I am using 2 gpus and I also suppose the partitions are distributed evenly to each gpu
Actual: There are more than 4 subprocesses started because I can see more than 4 prints of “starting subprocess”
def write_and_run(df):
#do write stuff
print("starting subprocess")
process = subprocess.run(shlex.split(command), stdout=open(stdout_filename,mode='a'),stderr=open(err_filename, mode='a'))
# do extra stuff
out = ddf.map_partitions(write_and_run)
#out.compute()
persisted_values = dask.persist(*out)
for pv in persisted_values:
try:
wait(pv)
except Exception:
print("Error encountered")
pass
Environment:
- Dask version: latest
- Python version: 3.7
- Operating System: Linux
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Apache Spark collect Exceptions or messages, which ...
Apache Spark collect Exceptions or messages, which describe unexpected behavior · Subscribe to RSS.
Read more >Change RDD.aggregate() to do reduce(mapPartitions ...
This way, we remove the unnecessary initial comboOp on each partition and also correct the unexpected behavior for mutable zeroValues.
Read more >pyspark-tutorial/README.md at master - map-partitions - GitHub
The mapPartitions() transformation should be used when you want to extract some condensed information (such as finding the minimum and maximum of numbers)...
Read more >DataFrame.map_partitions - Dask documentation
If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more...
Read more >PySpark mapPartitions() Examples
Similar to map() PySpark mapPartitions() is a narrow transformation operation that applies a function to each partition of the RDD, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It sounds like you want to explicitly not utilize all the resources available since you don’t want to run
write_and_run
in parallel on the same node. If setting the worker threadpool to only have a single thread is too restrictive (for example, if you have other tasks in your workflow that you do want to run in parallel) then you can use aSerializableLock
insidewrite_and_run
to ensure that only one of the worker’s threads can runwrite_and_run
at a timeThat’s great. thanks @jrbourbeau .
In such case, how to fully utilize the resource capacity without starting multiple subprocesses at the same time?
for example, I have a compute node with capacity
(6 cores, 56 GB RAM, 380 GB disk)
. if I limit the --nprocs 1 --nthreads 1
, only 1/6 capacity of compute node is used.