Managed per worker loaded large object in distributed worker memories?
See original GitHub issueFirst of all great job from the Dask community, I got into Dask a while ago for systematic parallelizing data workload and got fallen in love with it right away. I just posted an issue tonight thought may as well just post this one that bothered me for a while now.
I want to start with a more general question and see if it evolves into “an actual feature request”. (so if dosent fit as Feature Request pls move or close it): Does the objects loaded by the following codes alike, always got stored on the distributed memory for each worker, and when get_worker() is called, each worker always know to read that object from it memory, or there is a risk the object got included in the schedulers graph?
def init_custom_local_memory():
me = get_worker()
me.custom_local_memory = {}
return( True )
client.register_worker_callbacks( init_custom_local_memory )
def preload_local_data():
df_vocabs = load_df_vocabs_from_shared_storage()
me = get_worker()
me.custom_local_memory[ "df_vocabs" ] = df_vocabs
client.register_worker_callbacks( preload_local_data )
df_1gram.map_partitions(
lambda...
me = get_worker()
me.custom_local_memory[ "df_vocabs" ] = df_vocabs
do_stuff_on_read_only( df_vocabs )
)
It all started with I need to broadcast (or every worker to own one copy of) some very large and complex objects (dict of dict of objects. total size varies from 100MB to few GB), and tried all the solutions I could found, but none were effective (either never finishes or too much RAM), some of them are like:
- direct send as argument (included into graph i think?)
- scatter with broadcast (never finishes, maybe something weird happend when scatter dict of dict of various sized objects)
- delayed load function (blew up RAM?)
- delayed object (about the same as aboves)
Until I found this post and thought I could try something like the code above and it worked for most medium sized examples, and I kinda really like the approach: https://stackoverflow.com/questions/54432928/setting-up-dask-worker-with-variable
But today when I was lazy filtering about 20k CSV, the memory got bloated up during the first filtering pass and the RAM was all used by the scheduler, (could be due #8289 that the graph itself is too big), since the no tasks has even started yet. Im not sure since the objects are not managed we cannot see them, so it did gave some fear that those objects themselves got copied into the graph.
So is it possible at all maybe for something like below that puts an per worker object into Dask managed memory?
...
me.custom_local_memory[ "df_vocabs" ] = df_vocabs
client.track_object( custom_local_memory[ "df_vocabs" ] )
...
Again this is more of a sanity check and wanna know ur thoughts about it.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Sure, can close this as the solution is: use (Dask on) Ray object store.
@y-he2 Thanks for opening this! Based on your latest comment, it sounds like this issue is mostly resolved, and tracking #8289 would be enough? If so, please feel free to close this!