question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEA] API to write dask dataframes to local storage of each node in multi-node cluster

See original GitHub issue

[FEA] API to write dask dataframes to local storage of each node in multi-node cluster

Example requested API:

df.to_parquet(xxx, write_locally_per_node=True) 

Please feel free to suggest a better API to do this.

Usecase Details:

While working on multi-node dask cluster users often don’t have the shared storage available in the system. This problem becomes more worse for multi-node cloud systems when intra-node communication is often a bottleneck.

We need to write these frames locally to run a bunch of downstream tasks (think running multiple machine learning models on each node).

Having an API that allows this will simplify doing this.

Workaround:

We currently use the below workaround to achieve this but having it locally in dask will be super helpful.


## This function writes data-frames
def writing_func(df,node_ip, out_dir, part_num):
    import os
    worker_ip = get_worker().name.split('//')[1].split(':')[0]
    
    ### ensure we are writing using a worker that belongs to node_ip
    assert worker_ip==node_ip
    
    out_fn = str(part_num) + '.parquet'
    out_path=os.path.join(out_dir,out_fn)
    df.to_parquet(out_path)
    return len(df)

def get_node_ip_dict():
    workers = client.scheduler_info()['workers'].keys()
    ips = set([x.split('//')[1].split(':')[0] for x in workers])
    ip_d = dict.fromkeys(ips)
    for worker in workers:
        ip = worker.split('//')[1].split(':')[0]
        if ip_d[ip] is None:
            ip_d[ip]=[]
        ip_d[ip].append(worker)
    return ip_d
    
    
 ip_dict = get_node_ip_dict()
 
output_task_ls=[]
for node_ip,node_workers in ip_dict.items():
    
    ## create a task ls
    task_ls = [dask.delayed(writing_func)(df,node_ip,out_dir,part_num) for part_num,df in enumerate(dask_df.to_delayed())]
    
    ## submit the task to the workers on the node
    o_ls = client.compute(task_ls,workers=node_workers,allow_other_workers=False,sync=False)
    output_task_ls.append(o_ls)

len_list = [sum([o.result() for o in o_ls]) for o_ls in output_task_ls]

CC: @quasiben , @randerzander

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Jul 16, 2021

Er yeah, read “will definitely NOT work”

1reaction
VibhuJawacommented, Jul 14, 2021

Happy to take this on if the community thinks such an API will be useful broadly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Create and Store Dask DataFrames
You can create a Dask DataFrame from various data storage formats like CSV, HDF, Apache Parquet, and others. For most formats, this data...
Read more >
Dask DataFrame
A Dask DataFrame is a large parallel DataFrame composed of many smaller pandas DataFrames, split along the index. These pandas DataFrames may live...
Read more >
Distributed Pandas on a Cluster with Dask DataFrames
Dask is a Python library for parallel and distributed computing that aims to fill this need for parallelism among the PyData projects (NumPy, ......
Read more >
tsfresh on Large Data Samples - Part II - Nils Braun
Spin up a dask cluster. There are again multiple ways how to do this depending on your environment. A very good starting point...
Read more >
Working notes by Matthew Rocklin - SciPy
Dask dataframes combine Dask and Pandas to deliver a faithful “big data” version of ... with Dask.distributed to parallelize across a multi-node cluster:....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found