Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEA] API to write dask dataframes to local storage of each node in multi-node cluster

See original GitHub issue

[FEA] API to write dask dataframes to local storage of each node in multi-node cluster

Example requested API:

df.to_parquet(xxx, write_locally_per_node=True)

Please feel free to suggest a better API to do this.

Usecase Details:

While working on multi-node dask cluster users often don’t have the shared storage available in the system. This problem becomes more worse for multi-node cloud systems when intra-node communication is often a bottleneck.

We need to write these frames locally to run a bunch of downstream tasks (think running multiple machine learning models on each node).

Having an API that allows this will simplify doing this.

Workaround:

We currently use the below workaround to achieve this but having it locally in dask will be super helpful.


## This function writes data-frames
def writing_func(df,node_ip, out_dir, part_num):
    import os
    worker_ip = get_worker().name.split('//')[1].split(':')[0]
    
    ### ensure we are writing using a worker that belongs to node_ip
    assert worker_ip==node_ip
    
    out_fn = str(part_num) + '.parquet'
    out_path=os.path.join(out_dir,out_fn)
    df.to_parquet(out_path)
    return len(df)

def get_node_ip_dict():
    workers = client.scheduler_info()['workers'].keys()
    ips = set([x.split('//')[1].split(':')[0] for x in workers])
    ip_d = dict.fromkeys(ips)
    for worker in workers:
        ip = worker.split('//')[1].split(':')[0]
        if ip_d[ip] is None:
            ip_d[ip]=[]
        ip_d[ip].append(worker)
    return ip_d
    
    
 ip_dict = get_node_ip_dict()
 
output_task_ls=[]
for node_ip,node_workers in ip_dict.items():
    
    ## create a task ls
    task_ls = [dask.delayed(writing_func)(df,node_ip,out_dir,part_num) for part_num,df in enumerate(dask_df.to_delayed())]
    
    ## submit the task to the workers on the node
    o_ls = client.compute(task_ls,workers=node_workers,allow_other_workers=False,sync=False)
    output_task_ls.append(o_ls)

len_list = [sum([o.result() for o in o_ls]) for o_ls in output_task_ls]

CC: @quasiben , @randerzander

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Jul 16, 2021

Er yeah, read “will definitely NOT work”

1reaction

VibhuJawacommented, Jul 14, 2021

Happy to take this on if the community thinks such an API will be useful broadly.

Top Results From Across the Web

Create and Store Dask DataFrames

You can create a Dask DataFrame from various data storage formats like CSV, HDF, Apache Parquet, and others. For most formats, this data...

Dask DataFrame

A Dask DataFrame is a large parallel DataFrame composed of many smaller pandas DataFrames, split along the index. These pandas DataFrames may live...

Distributed Pandas on a Cluster with Dask DataFrames

Dask is a Python library for parallel and distributed computing that aims to fill this need for parallelism among the PyData projects (NumPy, ......

tsfresh on Large Data Samples - Part II - Nils Braun

Spin up a dask cluster. There are again multiple ways how to do this depending on your environment. A very good starting point...

Working notes by Matthew Rocklin - SciPy

Dask dataframes combine Dask and Pandas to deliver a faithful “big data” version of ... with Dask.distributed to parallelize across a multi-node cluster:....