Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Managed per worker loaded large object in distributed worker memories?

See original GitHub issue

First of all great job from the Dask community, I got into Dask a while ago for systematic parallelizing data workload and got fallen in love with it right away. I just posted an issue tonight thought may as well just post this one that bothered me for a while now.

I want to start with a more general question and see if it evolves into “an actual feature request”. (so if dosent fit as Feature Request pls move or close it): Does the objects loaded by the following codes alike, always got stored on the distributed memory for each worker, and when get_worker() is called, each worker always know to read that object from it memory, or there is a risk the object got included in the schedulers graph?

def init_custom_local_memory():
    me = get_worker() 
    me.custom_local_memory = {}
    return( True )
client.register_worker_callbacks( init_custom_local_memory )  

def preload_local_data():
    df_vocabs = load_df_vocabs_from_shared_storage()
    me = get_worker() 
    me.custom_local_memory[ "df_vocabs" ] = df_vocabs
client.register_worker_callbacks( preload_local_data )  

df_1gram.map_partitions(
    lambda...
      me = get_worker() 
      me.custom_local_memory[ "df_vocabs" ] = df_vocabs
      do_stuff_on_read_only( df_vocabs )
)

It all started with I need to broadcast (or every worker to own one copy of) some very large and complex objects (dict of dict of objects. total size varies from 100MB to few GB), and tried all the solutions I could found, but none were effective (either never finishes or too much RAM), some of them are like:

direct send as argument (included into graph i think?)
scatter with broadcast (never finishes, maybe something weird happend when scatter dict of dict of various sized objects)
delayed load function (blew up RAM?)
delayed object (about the same as aboves)

Until I found this post and thought I could try something like the code above and it worked for most medium sized examples, and I kinda really like the approach: https://stackoverflow.com/questions/54432928/setting-up-dask-worker-with-variable

But today when I was lazy filtering about 20k CSV, the memory got bloated up during the first filtering pass and the RAM was all used by the scheduler, (could be due #8289 that the graph itself is too big), since the no tasks has even started yet. Im not sure since the objects are not managed we cannot see them, so it did gave some fear that those objects themselves got copied into the graph.

So is it possible at all maybe for something like below that puts an per worker object into Dask managed memory?

...
me.custom_local_memory[ "df_vocabs" ] = df_vocabs
client.track_object( custom_local_memory[ "df_vocabs" ] )
...

Again this is more of a sanity check and wanna know ur thoughts about it.

Issue Analytics

State:
Created 2 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

y-he2commented, Mar 3, 2022

Sure, can close this as the solution is: use (Dask on) Ray object store.

0reactions

pavithraescommented, Mar 1, 2022

@y-he2 Thanks for opening this! Based on your latest comment, it sounds like this issue is mostly resolved, and tracking #8289 would be enough? If so, please feel free to close this!

Top Results From Across the Web

Worker Memory Management - Dask.distributed

For cluster-wide memory-management, see Managing Memory. Workers are given a target memory limit to stay under with the command line --memory-limit keyword ...

Scheduler memory leak / large worker footprint on simple ...

What happened: Running an embarrassingly parallel map_overlap workload may be causing a memory leak in the scheduler.

Using Durable Objects · Cloudflare Workers docs

Durable Objects provide low-latency coordination and consistent storage for the Workers platform through two features: global uniqueness and ...

Dask thought: distributed clusters and worker memory/storage

TL;DR when a Dask distributed cluster runs out of memory, ... you have to give each task a memory limit, tasks are scheduled...

Dask in Practice - Glenn K. Lockwood

Dask doesn't appear to have any notion of load balancing or ... The memory controls operate on a per-worker basis, which prevents host ......