[FEA] API to write dask dataframes to local storage of each node in multi-node cluster
See original GitHub issue[FEA] API to write dask dataframes to local storage of each node in multi-node cluster
Example requested API:
df.to_parquet(xxx, write_locally_per_node=True)
Please feel free to suggest a better API to do this.
Usecase Details:
While working on multi-node dask cluster users often don’t have the shared storage available in the system. This problem becomes more worse for multi-node cloud systems when intra-node communication is often a bottleneck.
We need to write these frames locally to run a bunch of downstream tasks (think running multiple machine learning models on each node).
Having an API that allows this will simplify doing this.
Workaround:
We currently use the below workaround to achieve this but having it locally in dask will be super helpful.
## This function writes data-frames
def writing_func(df,node_ip, out_dir, part_num):
import os
worker_ip = get_worker().name.split('//')[1].split(':')[0]
### ensure we are writing using a worker that belongs to node_ip
assert worker_ip==node_ip
out_fn = str(part_num) + '.parquet'
out_path=os.path.join(out_dir,out_fn)
df.to_parquet(out_path)
return len(df)
def get_node_ip_dict():
workers = client.scheduler_info()['workers'].keys()
ips = set([x.split('//')[1].split(':')[0] for x in workers])
ip_d = dict.fromkeys(ips)
for worker in workers:
ip = worker.split('//')[1].split(':')[0]
if ip_d[ip] is None:
ip_d[ip]=[]
ip_d[ip].append(worker)
return ip_d
ip_dict = get_node_ip_dict()
output_task_ls=[]
for node_ip,node_workers in ip_dict.items():
## create a task ls
task_ls = [dask.delayed(writing_func)(df,node_ip,out_dir,part_num) for part_num,df in enumerate(dask_df.to_delayed())]
## submit the task to the workers on the node
o_ls = client.compute(task_ls,workers=node_workers,allow_other_workers=False,sync=False)
output_task_ls.append(o_ls)
len_list = [sum([o.result() for o in o_ls]) for o_ls in output_task_ls]
CC: @quasiben , @randerzander
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:10 (9 by maintainers)
Top Results From Across the Web
Create and Store Dask DataFrames
You can create a Dask DataFrame from various data storage formats like CSV, HDF, Apache Parquet, and others. For most formats, this data...
Read more >Dask DataFrame
A Dask DataFrame is a large parallel DataFrame composed of many smaller pandas DataFrames, split along the index. These pandas DataFrames may live...
Read more >Distributed Pandas on a Cluster with Dask DataFrames
Dask is a Python library for parallel and distributed computing that aims to fill this need for parallelism among the PyData projects (NumPy, ......
Read more >tsfresh on Large Data Samples - Part II - Nils Braun
Spin up a dask cluster. There are again multiple ways how to do this depending on your environment. A very good starting point...
Read more >Working notes by Matthew Rocklin - SciPy
Dask dataframes combine Dask and Pandas to deliver a faithful “big data” version of ... with Dask.distributed to parallelize across a multi-node cluster:....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Er yeah, read “will definitely NOT work”
Happy to take this on if the community thinks such an API will be useful broadly.