question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed scheduler does not obey dask.order.order for num_workers=1, num_threads=1

See original GitHub issue

What happened:

Distributed scheduler appears to be ignoring the priority set by dask.order.order and instead does a bread-first execution order. This problem is particularly prominent when doing simple per-partition operation chains, where we are able to read one input partition, do some work, then write the results to a partition.

The end result for this particular usecase is that we load ALL parquet partitions first, apply the function to each one, then write each one to file.

While the issue is about num_workers=1, num_threads=1, I do think the suboptimal scheduling occurs in situations involving concurrency as well, and hopefully addressing the serial case will help with the concurrent case.

dask.order.order execution order:

Screen Shot 2021-12-01 at 6 47 14 PM

See how we process the left chain as much as possible before moving to the right chain, as expected

Actual execution:

Screen Shot 2021-12-01 at 6 46 36 PM

I made the function pause in the middle of execution on the top green node. Note how both chains are in-memory. I would’ve expected to see the bottom chain either completely grayed out (not started), or that the bottom-right node be the only thing in memory.

What you expected to happen:

We instead should be doing depth-first execution order like dask.order.order suggests.

Minimal Complete Verifiable Example:

import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask
from dask.distributed import Client

# Write data to parquet as a setup, to demonstrate unnecessary memory usage
df = pd.DataFrame(np.random.randn(100,100), columns=[str(x) for x in range(100)])

# Can change npartitions to 10 to show that this problem becomes severe at higher partition count
ddf = dd.from_pandas(df, npartitions=2)
ddf.to_parquet("input")

# You may need to set host and dashboard_address to expose dashboard
# We set n_workers=1 and threads_per_worker=1 to get rid of concurrency for simplicity
client=Client(n_workers=1, threads_per_worker=1)

ddf = dd.read_parquet("input")

def f(df, partition_info=None):
    # Pause execution on second partition
    if partition_info["number"] == 1:
        import time
        while True:
            time.sleep(5)
    return df

ddf = ddf.map_partitions(f, meta=ddf)
ddf = ddf.to_parquet('test', compute=False)

# Run this to show that the dask.order.order output is reasonable
# ddf.visualize(color="order")

# This will hang (on purpose). Go to dashboard graph view to see the graph screenshotted in this issue.
dask.compute(ddf, scheduler=client)

Anything else we need to know?:

Environment:

  • Dask version: 2021.11.2
  • Python version: 3.8.x
  • Operating System:
  • Install method (conda, pip, source):

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:3
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Jan 11, 2022

That could be true, yes. Mostly I think I’m acknowledging the problem you’ve raised, but rejecting the specific example (while commending you for taking the time to make a minimal example). If you can help to construct a better example then that would be welcome.

1reaction
davidhao3300commented, Dec 3, 2021

Thank you for the detailed response, I think we have enough info to proceed forwards with optimizing our own graphs, I’ll let you know if anything else weird pops up!

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask distributed num_workers are not working - Stack Overflow
Tried this simple dask call on several different servers, some servers used 96 cores, other servers used only 30% of 96 cores. But...
Read more >
Source code for distributed.scheduler - Dask documentation
_stories import scheduler_story from distributed.active_memory_manager import ... such as if they are on the critical #: path, or good to run in order...
Read more >
Artificial Intelligence with Python Cookbook - kgay4all.com
You really should have a GPU available in order to run some of the ... memory (and zero-copy serialization) and uses a distributed...
Read more >
ipcf.ipynb - Christoph Gohlke
The above definition is not unique and sometimes correlation may be defined ... Python runs numerical code orders of magnitudes slower than ...
Read more >
Combining Dask and PyTorch for Better, Faster Transfer ...
PyTorch has the ability to train models across multiple machines, and thanks to the framework Dask you can easily create a GPU cluster...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found